transcript of episode 40: OPEN SCIENCE & ITS ENEMIES, PART II, 31st July 2024
[🎶 INTRO: "Spring Swing" by Dee Yan-Kee 🎶]
welcome to the error bar: where a clear plan for bold action leads to stability of change
in this episode i continue discussing the failure of some parts of the open science movement to deliver change for the people. i focus on populist podcasters
here is the brain news on the 31st July 2024:
Part II: The populists
prologue
in this episode, I present the second part of my three-part critique of some aspects of the open science movement. Part I was motivated by publication of the latest example of the poor, but popular open science tactic of p-circling – the post-hoc selection of p-values in other people's papers that is then used to justify harsh, unfair & unprincipled critique of the work. in Part II, i broaden the scope of, & the evidence for, my argument that the worst enemies of open science often come from within the open science movement itself; they wear its badges & they run its populist propaganda machines.
deviating from the error bar's usual format, this is a very long episode. it's not much longer than many other science-based podcasts, & the error bar releases episodes far less frequently & more sporadically than the market leaders. with this half-apology for its length, let's do the 40th episode of the error bar.
open science and its enemies. Part II: the populists
at the end of Part I, i introduced the second part of my essay rant against some parts of the open science movement by saying that i would focus on the influence of podcasters. shortly after releasing the episode i realised that that was not the central topic of Part II. criticising podcasts would be like criticising stone, parchment or the electromagnetic spectrum – they're just a medium. instead, i realised that Part II was about scientific populists, about researchers making populist appeals promoting simplistic solutions to the problems of science. i'll start this episode with an overview of what i mean by scientific populism. then i'll provide examples of how this populism manifests itself in various podcasts & other outputs of scientific reformists. then i'll broaden the argument to cover a range of heuristics put forward to solve the problems of science. the conclusion will be that the problems of science are hard, that they have no easy solution, but that there are scientists working on them. we’ll here more about those scientists in Part III.
populism
i follow politics closely, & the success of centre-right populist ideas often comes up on the centre-left podcasts that i listen to. for this episode i will jump on this bandwagon & shoe-horn this term of abuse to frame my critique of – some parts of – the open science movement. if you're a political scholar & want to compare this podcast with the rise of post-truth, polarisation & populism in the west, then email talk at the errorbar.com.
i looked up populism on wikipedia & cherry-picked this very limited definition: "in popular discourse, populism is sometimes used in a negative sense in reference to politics which involves promoting extremely simple solutions to complex problems in a highly emotional manner."
this may annoy the one political scholar listening, but for the argument in this episode i assume that the populists are a group of open science protagonists who seek extremely simple – i will argue simplistic – solutions to the complex problems of science. they identify a single clearly-defined problem, they may divide scientists into groups, rate or reward their behaviour with badges or rankings & at times they get emotional. when you hear their pleas for reform, you as a scientist must choose: are you a part of the problem or a part of their solution?
the scientific populists propose solutions to scientific problems that are simple enough to be printed on bright red hats & sold online or at conferences to Make Academia Great Again. the merchandise may be black & white, but the problems are anything but.
the scientific populists' political slogans include: "abandon null hypothesis significance testing", "use confidence intervals", "become a Bayesian", "report p-rep instead of p", "science before 2011 is suspicious", "judge a scientist not by the content of their character but by the shape of their p-curve", "redefine statistical significance", "replicate, replicate", "all studies must have N greater than 30".
i've covered some of these claims in the error bar before – especially episodes 27, 28 & 31. in the last section of this episode, i'll address a number of these populist appeals. for each one i'll say why – on its own – it does not provide a new direction for science to take. i'm not really against any one of these reforms, & as part of a range of reforms, each one may well provide a small & important course correction. but on their own they are a long way from the solutions that scientific populists claim they are.
scientific populists are bad for science in a similar way to how political populists are bad for politics. while we, the scientists, may see the problems of science clearly, the existing solutions may seem obscure, blurred & out of our reach. if only a strong, bold leader would emerge to shine a light on the path ahead, we could make quick progress. never mind that the path eventually goes nowhere, or goes off a cliff, or into a swamp – that will be somebody else's problem. the populist will soon leave, taking their next job in industry, banking, or the media. as the political satirist Jonathon Pie put it: "that's how populism works: it promises the moon & instead it hands you a DVD copy of 'Apollo 13'".
podcasts
the least observant amongst you will know that i run my own fact-checking brain science podcast, the error bar. given that, in this 40th episode i turn a critical eye to the errors of some other podcasters – i am calling them populists – i run the risk of being a massive hypocrite. please call out my hypocrisy by emailing talk at the errorbar.com.
i listen to podcasts for perhaps an hour every day. most i listen to obsessively from start to finish so that i don't miss anything. over the last year or so, i have been binge-listening to The Black Goat podcast – a podcast about doing science. as a whole, The Black Goat's 86 episodes are a top, top, top, world-class listen. i thoroughly recommend them to anyone – especially more junior researchers. the three Goat hosts are fun to listen to, well-informed, serious about science & they get along like great friends. if you've got an 86 hour train-ride coming up, start listening. i was dreading the end of [their] podcast series as i realised what a joy it had been. i will return to criticise some of its populist appeals later.
The Studies Show
one science podcast that i could not get past the first few episodes of was The Studies Show by Dr Stuart Ritchie & Tom Chivers. Ritchie was an academic psychologist at Edinburgh & King's College London in the UK, & wrote a book called Science Fictions about the replication crisis. Chivers is a writer & journalist. together, their podcast does much of what the error bar tries to do, but they do it with thousands more people listening & covering a much wider range of popular science topics on which they proclaim some expertise, every week.
i stopped listening – & started planning this three-part critique – when Ritchie said on episode 10 of their podcast that "EEG does produce very noisy data". in the previous episode [Part I of my mini-series] we heard a Professor of statistics & a self-declared data thug loudly describing a recently-published dataset as "noisy" apparently on the basis that they did not like the results. something in my head clicked when i heard Ritchie say that 'EEG is very noisy' – it was a moment of realisation for me – so i will spend some time describing that moment & what i think it means.
to say that one of the most-commonly & long-used techniques in human neuroscience is "noisy" betrays a deep ignorance & reveals the populist streak that i am attacking. electroencephalography – or EEG – relies on recording electrical signals from the brain. we've been recording these signals for 150 years & there's no question that they both exist & are extremely valuable in medicine & science. instead of opening up someone's head or body to record directly from the nervous system, you can place electrodes on the skin over a nerve, over the heart, over the spine, or anywhere on the head. with the right electrode placement, the right amplifier settings, the right stimulus & sufficient repetitions, you can record aspects of the working nervous system that are unavailable with any other method. i recorded my first spinal potential earlier this year, with a single ~5 minute recording from the back of my neck.
EEG is amazing. given that we are recording electrical activity several centimetres away from its neural generators, it's remarkable we get any useful signal at all. to make any sense of Ritchie's statement that 'EEG is very noisy', we need to be generous to him. let's assume that he was expressing his personal frustration that EEG data generally does not answer the cognitive psychological questions that are of interest to him. or perhaps his own EEG experiments didn't work out the way he wanted. in this characterisation of his view, then, EEG is a 'noisy technique', & any study presenting EEG data can therefore be ignored. i understand this frustration: there's too much science out there, we can't appraise it all or be an expert in it all, & much of it is irrelevant to our interests. but if you run a general science podcast where you present yourself as an expert, you need to make a judgement & a choice: do you claim that an entire field of neuroscience is "noisy" or do you just say that you are not an expert? populists make one choice, scientists another.
in writing this episode, i returned to episode 10 of Ritchie & Chivers' podcast to listen again. to repeat, listening to this episode made me unsubscribe to The Studies Show. the episode is about whether direct cash transfers to people in need produce more benefit than other, more organised forms of aid delivered by governmental & non-governmental organisations. Ritchie introduces – & criticises – three papers, the first of which i will now review.
the first paper, published in Nature was a meta-analysis of a wide range of economic interventions across low- & middle-income countries. in the intervention, cash was transferred to people. the control conditions were comparable periods without cash transfers. the outcome variable was deaths per person year. the dataset is enormous: 37 countries over 20 years; 4.3 million adults & 2.9 million children in the study sample; 48.6 million person years of data analysed. 20% of the data came from intervention phases, 80% from control periods.
the [study] authors assessed the effect of cash transfers on mortality. it's a very simple test: did fewer people die after receiving cash? the answer is a categorical "yes": cash decreases mortality. putting all the data together, there were 0.41% deaths per person year after a cash transfer, compared to 0.67% deaths per person year without the intervention. that is an astonishing 38% decrease in death. the sample size is 37,190 deaths following cash & 252,012 deaths during the control periods. if you scale up the deaths after cash transfers so that the total person years of the intervention sample matches that of the control sample, then we have an estimated 155,000 deaths after cash, compared to 252,012 deaths after no cash. this is an astonishingly large effect that barely requires any statistical analysis.
to the authors' credit, they did do statistics. they performed some "multivariable modified Poisson regression models". these methods i do not understand & am not an expert in, but i can guess that they are a form of generalized linear model in which the Poisson distribution is used in place of the normal distribution. Poisson distributions are useful where the data are discrete counts of events that occur with a particular probability over a particular period of time. in short, they sound like exactly what is needed to analyse numbers of deaths per unit time. mortality was analysed separately for men & women, & separately for children aged under 5, 5 to 9 & 10 to 14 years of age. the authors reported evidence for a significant effect of cash transfers on the mortality of women & of children under five years. i checked the numbers myself by adding them up & doing some very simple – & likely wrong – chi-square & binomial tests. the p-values had between 22 & 109 leading zeros in them – very, very small p-values; very very strong statistical effects.
on The Studies Show, this paper was described & reviewed as follows. i'm going to quote this at length because it's really quite astounding in how it goes from evidence to evaluation. i've added some selection & emphasis, so i'd encourage you to listen to the original as well:
RITCHIE: "Nature, big deal journal... whether cash transfers work to make people live longer... it's a lot of noise, a lot of noise that can be introduced there, which i think makes the results a bit less reliable anyway... the problem is this: what jumps out at you when i read this to you, Tom? 'We found that cash transfer programmes were associated with a 20% reduced risk of death in adult women, & an 8% reduced risk in children aged younger than 5 years old.'"
CHIVERS: "ok, so my immediate question is did you have a reason to think that you'd see it in women but not men, & in under five year olds... rather than others... i mean maybe it's real, but i'd like to know that you'd specified these outcomes in advance."
RITCHIE: "yeah, that's my problem. is this screams subgroup analysis, like, they didn't necessarily find the result in the whole sample, so they went into the subgroups, splitting it up by sex, splitting it up by age, but only in children & not in adults. i'm not sure what the story is there, i suppose you could give a rationale for it. but the fundamental problem is they don't, they didn't preregister their study... so we don't know what their pre-existing rationale before they started the analysis was. did they say 'well we think this is only going to happen in women & kids under five'? or did that, just kind of occur, as the analysis was done... if you didn't really have an idea & just kind of blunder in to the analysis & just kind of do things & see what's significant, that can often imply that there might have been some bias there in the analysis, you might have unconsciously pushed the results to try to find a particular result, or maybe just something random happened & you don't necessarily know if that's going to replicate in any other sample that you then collect in the future."
[Chivers then gives a coin-tossing example, comparing the analysis of deaths in women versus men to tossing lots of 10 pence & 5 pence pieces, & only reporting the results for the 10 pence coin.]
RITCHIE: "it's just a fluke, yeah. & it's a kind of p-hacking, this idea that you're trying to get the statistically significant p-value by hacking the results, & there are different ways of doing that, one is running your analysis lots & lots of different times, & one is dredging through a big dataset until you find something that's significant. & the fundamental thing is we know that if you run the analysis lots & lots of times you are more likely to find a false positive result, that is one that exists in your dataset, but not in any other dataset... they do lots of other analyses... & they describe those as exploratory... which is really weird... why should we consider those ones exploratory & not the other ones, because you didn't preregister it, you didn't tell us which ones you were going to plan to do, beforehand. why should we consider those ones exploratory & not the whole paper to be an exploratory analysis. & why didn't you just correct them for multiple comparisons, because you can do that... "
listening to their discussion, it is hard to believe that either of these journalists has opened the freely-available paper to read it in any detail. every one of Ritchie's critical questions were answered in the paper. he complains that the paper "screams" of subgroup analysis, as if the authors were trying to hide it. no screaming is required – it is written in the paper multiple times. Ritchie 'supposes that the authors could give a rationale for the subgroup analyses'. no supposition is required – the rationale is clearly given in the paper: female heads of households aged 15-49 years were the primary respondents in all survey data; previous studies stratified the data by sex; mortality data is typically given by sex & in age brackets including, for children, under 1, under 5, & over 5 years. the authors referred to the relevant UN all-cause mortality data. Ritchie speculates that maybe there was no effect in the whole sample, & that's why the data were stratified. again, the answer is in the paper: every reported effect was in the same, predicted direction of lower mortality after cash transfers. the confidence interval for the male data just includes zero, so adding the male to the female data – pooling the means & the variances – could only produce an overall significant effect. it is to the [study] authors' credit that they analysed the data using a more complex form of analysis than i am able to do. in short: there is no doubt at all in my mind that 252,012 deaths is a larger number than 155,000 deaths. the pattern of less death after cash is shown in every subgroup analysed.
what can explain Ritchie's lazy, incoherent rambling, about "a lot of noise" accusing the authors of "blundering in" to their analysis, "unconsciously push[ing] the results to find a particular result", claiming about this paper that "it's just a fluke... a kind of p-hacking", "running your analysis lots & lots of different times... dredging through a big dataset until you find something that's significant". i found myself really scratching my head here. what explains this attitude? all I can guess is that Ritchie read the abstract, saw that the effect was found in women & young children but not in other subgroups. by that time he had already written the headline for his story, so he put the paper down, didn't check the data, didn't look at Figure 2, didn't read the detailed explanations in the text, didn't read the reviewers' reports, didn't check the cited literature, didn't open the UN mortality spreadsheet. i did all of those things – quite superficially – in about 15 minutes. Ritchie seems particularly upset that the study was not pre-registered. if it had been pre-registered, would he have read the pre-registration?
over 40 episodes of the error bar, & over three & a half years, i have made some unwarranted conclusions, i have made some mistakes, & i have made at least one person cry. i am sorry for these errors & i hope that i have dealt with them in follow-up episodes or in person – please email talk at the errorbar.com if that is not correct. but i can guarantee you, dear listener, that in every story i presented, i actually read the fucking paper that i was criticizing. & if i did not read the fucking paper, i told you so & i limited my critique. reading the fucking paper is why episodes of the error bar focus on my limited areas of interest & expertise, last about 10 minutes, take a day each to produce & appear on average every 33 days. The Studies Show, by contrast, covers all science, lasts an hour & appears every 7 days. for every minute of the error bar, The Studies Show spews out 24 minutes. so what do you want from your science podcasts, guys: quality or quantity?
it is not surprising that science journalists producing clickbait content for the mass media don't have time to read the paper & are not experts, but why can't they tell us that rather than hallucinating pernicious motivations & accusing scientists of "noisy", "blundering", "p-hacking" without evidence? on this, Ritchie has form. in my first interaction with Ritchie – on twitter – i asked readers to fact-check his criticism of a large meta-analysis of brain imaging studies. he had stated that:
"the first thing you've got to know about structural brain imaging studies is that there are endless variables you can analyse: volume, surface area, cortical thickness, dozens of different white matter properties... etc etc. so why'd they pick cortical thickness in particular?"
i suspected that Ritchie was over-generalising on this claim. he replied to my fact-check request, so i asked him specifically why a paper reporting the analysis of structural brain data from T1 scans – a common type of brain scan – would produce "dozens of white matter properties...". typically, white matter properties are derived from a different type of brain scan – diffusion-weighted scans – which are both less commonly done & not the focus of the target paper. Ritchie replied that the authors could have analysed these measures, so his critique is therefore "perfectly fine". this is astonishing: apparently scientists must be held to account for all of the analyses that they could have done. aggravated by his response, i checked the target paper. of 101 studies included in the meta analysis, only 25 included any 'white matter variables'. Ritchie, faced with this fact, withdrew that part of his critique, but continued to claim that the rest of the critique stands.
to summarise my attack on Ritchie's approach to scientific criticism: he doesn't read the papers, he mis-characterises the authors' motivations, he misreads their methods, he hallucinates variables & he accuses authors of p-hacking, before moving on to the next study that features on his weekly hour-long entertainment show. i have now fact-checked a total of two of Ritchie's critiques. both revealed his superficial reading of the work in question & his lazy analysis. this is superficial, populist, tabloid-quality science journalism.
noisy data
it's not just EEG that has been said to be noisy. over the years i've heard students, postdocs & professors all saying that "behaviour is noisy", "fMRI is noisy", "TMS is noisy". i challenge listeners to identify a single topic or tool which all researchers, at all times, would describe as "quiet". the complaint of "noisiness" tells us only about the scientist & not the science.
"noisy", then, is a cry for help. it expresses defeat, frustration, the sense that science is overwhelming, that we can't understand everything, that we can't be an expert in our lifetimes on more than a very thin slice of the science cake. to ease their suffering, these populists loudly project their ignorance into claims that "this method is noisy" or "that dataset is noisy". once made, this claim hides from view a whole field or method or dataset. it quietens the populist's mind so that they can move on to their next provocation.
populist scientists proclaim expertise & see noisy data everywhere. humble scientists must proclaim their ignorance, withhold comment & defer to experts. Rory Stuart – one of the more intellectual UK politicians – echoed this in his recent series on The Long History of Ignorance.
everything hurts
likely one of the most successful & influential podcasts about doing psychological science is Everything Hertz by Drs James Heathers & Dan Quintana, now on its 182nd episode. a week or so after Part I of my three-part series came out, i heard Quintana & Heathers discuss the preprint by Gelman & Brown which i had discussed in [Part I]. Quintana was a big fan of the commentary, saying it was: "a really interesting paper... i really liked this paper... this was a really cool paper."
they also praised the commentary for its criticism of the target paper – about the effect of perceived time on others' perception of healing – saying that these "unreplicable" papers provided "a very uncritical look at the prior literature" & that "when you just scratch at the surface a little bit you find that a lot of the premises are quite shaky".
it's not surprising that Quintana & Heathers seem unwilling to criticize a paper written by Dr Nick Brown – a friend & past guest of their own podcast. but they give no evidence that they've actually read the target paper about cupping. instead, they parrot Gelman & Brown's flawed, lazy critique. i will return to Quintana's point about what happens when you scratch the surface of these populist papers & podcasts towards the end of this essay. it's a clear case of charred, lazy pots calling the new kettle black.
The Black Goat
having dealt with the scientific verbal diarrhoea of The Studies Show, let's return to my favourite science podcast, The Black Goat. there is so much good content on the podcast that the critique i'm about to give might sound a bit rude. i don't mean it that way, but it serves my argument. after listening to all 86 episodes, i have three concerns.
the foundation myth
the first is about how the presenters refer to the replication crisis. the second & third – closely related – concerns are their endorsement of simple numerical metrics to judge the quality of science.
the first concern is with the foundation myth of the replication crisis. according to much of the replication crisis literature that i have read, the crisis was first uncovered in 2011 – refer back to Part I of this episode for details. according to this characterisation, before 2011, scientists were in relative ignorance about the quality of scientific methods, the role & the biases of the researcher in designing & interpreting experiments, the mis-use of statistics, or about changing hypotheses after the results have been seen.
this characterisation is obviously false – as i'm sure many proponents of crisis would agree. good, careful, cautious, unbiased science has always been done. yet alongside this good science there has always been some small proportion of frauds, lazy scientists & scientists so taken with their own hypotheses that they don't see their own errors or biases.
among the 86 good hours of Goaty content, there were about fifteen seconds of Black Goat bleating that i now identify as manifestations of scientific populism. The Black Goats said something like: 'do you judge pre-2011 papers differently to post-2011 papers?' & there was general agreement that yes, they do – or at least they did at the time.
the idea that work published before the so-called 'replication crisis' is somehow more suspicious than work published later is very problematic. so problematic that i can't really believe that The Goats really believe that. it was partly a throwaway line, but it is repeated several times on the podcast, for example at 1 hour & 1 minute into The Expertise of death, & i've heard it a couple of other times.
assuming that the proponents of crisis really do believe there is a cut-off, does it matter exactly when the work was conceived, or when it was done, or just when it was published? even if we allow some fuzziness over the exact cut-off date – a five-year window, say – the idea that pre-2011 science was worse than post-2011 science requires some further very strong assumptions: first that every scientist simultaneously became aware of the replication crisis & second that they simultaneously & immediately responded to it by changing the way they worked. this just isn't plausible.
several months ago i asked some close, quite senior science colleagues what they thought of the "replication crisis". they had not heard of it. we should be very wary of assuming that science is somehow different now than it was just 13 years ago. we should be even more wary that any differences are directly attributable to the replication crisis.
my complaint here is that open science reformists sometimes seem to believe that there was a date when the revolution began & that science before this date is therefore less trustworthy than science after this date. boiled down & popularised up, a single number – the year of publication – can now be used as an indicator of scientific quality. this is naïve at best & dangerous at worst. non-populists would do well to avoid making any such claim.
if you are told by a populist that the replication crisis began in 2011 & that all science before 2011 is therefore lower quality, you can give them two examples to show that it did not, that – even before 2011 – scientists cared about quality. the first is a paper by Kerr in 1998 in which the term 'HARKing' – 'Hypothesising After the Results are Known' – was gifted to the world. 13 years before the replication crisis! but my favourite example is from Theodore X Barber. in 1976 Barber published Pitfalls in Human Research: Ten Pivotal Points – a book which discusses many of the questionable research practices that we still discuss today, but in the context of a similar crisis of confidence in certain topics in 1960s social psychology. 35 years before the replication crisis! [i shall return to this book in a later episode].
p-curves
the second concern that i had with The Black Goats is similar in kind, but potentially more serious in consequence.
my understanding of what has followed from the replication crisis is that a small community of scientists has become more aware of many of the problems in science & that this community has made great efforts to find better ways of talking about it & doing it. this is all good news, at least for that community.
sometimes, however, the open science community has gone astray. just like in the absurd claim that science after 2011 is better than before, other single-number metrics have been popularised to diagnose & to treat poor scientific practice. the first is the p-value, the second the number of datapoints you have. p & N.
a p-value between 0 & 1 is typically used to estimate how likely a set of observations is, assuming that there was nothing really of interest in the data. for example, if you're comparing the heights of cats & dogs & you find that dogs are 10cm taller than cats, the p-value will tell you how likely it would be that your data would show at least this 10cm difference, even though cats & dogs were really the same height. [spoiler alert: cats & dogs are not the same height, but this is not what the p-value is telling you.]
i find p-values very helpful, & i even think i understand them after 31 years of doing statistics. in the years following the replication crisis of 2011, there has been a great focus on p-values. much too much focus, in my view. this could be in part because, on the whole, scientists don't understand p-values. three times i've asked groups of 10 or more PhD & more senior scientists to select the one correct definition of p from four options. to date, no group has got it right, on average. but the reformists' focus on p-values is also in part because the p-value has been seen as a target – a goal that needs to be hit to allow your scientific work to be published. as long as p is less than 5%, you have scored a goal. i agree that p has been & is still a target for science; but i disagree that it is anything like as problematic as many of the crisis scientists claim.
the claim of the crisis scientists is that, because p is a target, scientists will therefore adapt their behaviour to meet that target. they will cut corners on their methods, remove outlying datapoints, stop collecting data when they score a p-goal & selectively report only those studies or datasets that hit the back of the net. all of that surely does happen; perhaps it happens too much; & it is frustrating that it still happens. so what can we do about it?
one approach is to look at the p-values that scientists report in their papers. i discussed some of the problems of this tactic in Part I, the p-circlers. but instead of picking-out individual p-values to frame a critique, you can instead look at a representative body of work – a paper, a set of papers, or a whole scientist's output – & count how many of their p-values are, say, lower than 1%, between 1% & 5%, or higher than 5%. if there are 'too many' p-values in the 'uncanny valley' of between 1% & 5%, then the body of work is seen as suspicious. this process is called p-curve analysis [see also here]; you can also do it with other statistics like the Z-score. & it drives me up the fucking wall.
the naïvety & simplicity of this approach – counting p-values – turns the complicated, messy, theoretical, social business of science & its evaluation by peers or experts, into yet another single-number metric with which to beat people up. a proponent of the p-curve method – Ulrich Schimmack – once wrote to me on twitter. he had calculated my Z-curve & presented me with my score. he said my scores were 'not bad,' but there was 'a curious concentration of Z-values around 3.1', & he wondered if i could explain why. i did not reply. another twitter user responded on my behalf, which i very much enjoyed: 'maybe he tests non-random hypotheses?'
my understanding of p-curve analysis is that, like p-values, it assumes that the null hypotheses that form this set of p-values are all true, & that the data being tested come from continuous, random, independent, normal distributions. any deviation from that pattern – particularly any concentration of p-values in the range between 1 and 5% – is taken as evidence of questionable research practices. it's fine to run simulations using these assumptions & to look at how simulated p-values compare with real data, but the alternative explanation for unexpected distributions of p does not have to be scientific misconduct. Professor Dorothy Bishop argued this point in 2016. scientists do not tend to spend their time randomly testing hypotheses on completely independent samples & reporting them only when p goes below 5%. they also may not generally measure perfectly-normal distributions, nor do they tend to collect perfectly-continuous data. continuous, random, normal distributions are excellent ways to model or to simulate theoretical data, but on their own they do not provide a justification to write to scientists on twitter to tell them that their p-curves look 'ok' or 'not ok'.
back to The Black Goat, & another five seconds of dubious content, this time about assessing the quality of someone's work or the likelihood that they are doing bad science. i ran out of patience trying to find the specific quote, but The Goats said something like: 'we can look at people's p-curves' or 'we can look at how many p-values are between 1% & 5%' to get an idea about the quality of a scientist's work; the Goats referred to the Z-curve enthusiast Schimmack in a positive way, as if this populist approach to counting p-values might help fix the problems of science. it doesn't, it can't, it won't. bad Goats.
N
the third of my Goat-related gripes is similar, but this time with N, the number of independent bits of evidence in a dataset. again, the problem is the over-reliance on a single metric. again it is an occasional five-second comment on The Black Goat which got me thinking. & again i profoundly disagreed. the claim is that studies with 'small N' are less reliable than studies with 'large N'. while usually left vague by most scientists, occasionally the terms 'small' & 'large' are actually defined, as they were in the particular episode that made me write this part of my critique. in episode 69, The Last Straw, one of the hosts – Dr Alexa Tullett – was discussing how to get the right balance of expertise when reviewing other scientists' work. Tullett said:
"for a while I was reviewing... a fair number of EEG papers... &... sometimes people would complain that I didn't acknowledge... the norms for Ns in EEG papers... my evaluation of those papers... added something to somebody who was just going to say... 'it's normal for an EEG paper to have... 30 people, so that's fine' because it's actually not fine. & i think that i'm right about that."
according to Tullett's view, 30 subjects is not a sufficient sample size for an EEG experiment. it's interesting that EEG comes up again here, & is singled-out for special treatment. why this one method recurs in these critiques tells us more about the scientists & their research topics than about EEG. my conclusion is that EEG is just not the right tool to answer these scientists' questions. further, i suspect that just increasing the number of EEG datasets they collect will not help them very much either. telling a generation of scientists on your podcast that N needs to be above 30 in every EEG experiment is very poor advice. listeners to episode #37 of the error bar will know that any such target for N is meaningless on its own.
heuristics
well, that deals with the fifteen seconds of dubious content among The Black Goat's 86 hours of quality broadcasting.
i've spent a lot of time discussing The Studies Show & The Black Goat's populist outbursts here because they came to me at about the same time, they both involved criticism of EEG & they were both profoundly wrong. they were wrong for interesting reasons, which i shall now develop.
the populist's appeal to reform science starts with an easy, but weak, criticism, one that most of their constituency might tend to agree with, or at least not be hostile to. in Ritchie's criticism of EEG we have someone who – as far as i can tell – is not an EEG expert & has not published EEG papers, telling his audience of more than 5,000 that EEG is noisy, & that its products can therefore be ignored. in Tullett's claim that sample sizes of 30 are not sufficient for EEG research, we have someone who is an expert in EEG & has published papers using this method, but who is extrapolating what may be sensible advice for this one very small research field to apply to an extremely large range of research fields. perhaps EEG studies of higher-level cognitive phenomena of the type that Tullett has worked on – nostalgia, performance-monitoring & empathy – may well require larger samples than are typically collected in other fields. but there is no cause at all to extend this sample size estimate to any other use of the same general method. evoked potentials can be measured with a few seconds' or minutes' data collection in a single subject.
[Editorial comment: i am currently learning how to collect EEG data. i only really need the data at the moment to find out exactly when a signal from the hand reaches the brain. the following figure shows data collected in about 15 minutes from a single healthy adult participant. we stimulated the median nerve at the wrist using a 9.4mA electrical pulse, & repeated the stimulation 665 times, once every 1.6 seconds. because i am terrible at this, & especially because i made my own electrodes by salvaging wires from a bin, the electrical signals were very 'noisy'. but, because i know that the signal i am looking for is about 2-4 uV in amplitude & peaks at about 20ms after the stimulus (the 'N20 potential'), i can focus-in on that part of the data, & exclude any trials where there was a lot of 'noise' (>25uV peak-to-peak) within the critical 10-40ms window. obviously i need to get better equipment & learn how to do this properly. however, given that my research question is: "when is the peak of the blue wave (which i expect between 15 & 25ms in the below graph)?", EEG is an extremely "quiet" method! (by contrast: if my research question is: "how does the N20 potential latency relate to the feeling of empathy while watching football penalty-taking?", i imagine that EEG could be described as: "quite noisy")
in this image, the stimulus was applied at 0ms on the wrist. at ~15ms there is a noticeable increase up to ~18ms, then a decrease to ~23ms, then the signal returns to the baseline level at ~40ms. dark blue=mean, light blue=+-95% confidence interval.
dismissing all EEG studies, or requiring that all EEG studies exceed a single numerical criterion, are populist pleas born in response to the everyday difficulties of science. they are personal & limited statements of contingent, contextual belief & they do not require a wider response.
these populist pleas for scientific reform are not limited to the method of EEG or to the two podcasts that i happened to be listening to last year. to finish this rant, i'll take you on a quick tour of several decades' worth of populist, heuristic appeals for scientific reform.
thou shalt replicate!
the current framing of the problems in science as a "replication crisis" is a populist appeal that the solution to our scientific problems is to value replication above all else. what matters is what replicates. but bad science can also replicate, & replicating bad science does not make it good. replication alone is not the goal. in many cases, as Professor Dorothy Bishop pleads more research is not required.
thou shalt redefine statistical significance!
in 2018 a self-proclaimed "critical mass" of scientists made the populist claim that changing a single numerical criterion – alpha – from 0.05 to 0.005 was a necessary first step in improving the evidential basis of all science. several papers – some with an even more critical mass – responded that the move was not a sensible one. these "alpha wars" generated a lot of citations & media coverage for the authors on either side, myself included. but i can’t remember reviewing or reading a single paper since those wars which either uses a redefined level of statistical significance, which justifies their alpha, or which abandons statistical significance. it seems that the alpha war of 2018 had no consequence.
thou shalt not study para-psychology!
in Part I, I discussed Gelman & Brown's populist response to the article on cupping, healing & time perception by Aungle & Langer (2023). to remind you: Gelman & Brown didn’t read the paper, ignored evidence against their presuppositions, mis-characterised the results of the study, inconsistently interpreted effect sizes & accused the target authors of p-hacking. i concluded that Gelman & Brown did this because they did not like the results – that people's perception of time passing might change how quickly a mild bruise might be perceived to heal.
my suspicion is that studies like this attract attention & are selectively attacked because they deal with research topics on the border between the psychological & the para-psychological or the super-natural. populist crusaders like Gelman & Brown were radicalised in the crucible of the post-2011 replication crisis, which began – arguably – with a paper on predicting the future by Professor Daryl Bem. with scientific virtue on their side, the knights of scientific populism – and they are all men, it seems – ride out on their white horses, slaying the papers of non-believers who dare to cross their path by working beyond what they deem acceptable science. [adopts pompous Arthurian voice] 'Sir, you have published a paper on potential psychogenic effects on wound healing without a fully-worked out neurophysiological mechanism. watch – infidel! – as i run through your work with my pointy blade of steel.' this is Monty Python-level satire.
thou shalt report p-rep!
in 2006, an entire journal took a populist turn by requiring all authors not to report p-values any more, but instead to report a number that takes the p-value, divides it by 1 minus p, raises it to the two-thirds power, adds one, then takes the inverse. this new value – supposedly completely different from & yet monotonically, perfectly, & directly related to p – is known as
thou shalt use confidence intervals!
the catastrophe of p-rep was part of a longer-term effort to reform statistical practice pioneered by Professor Geoff Cumming & colleagues. since the start of the twenty-first century, Cumming has studied scientists' willingness & ability to report & understand confidence intervals in place of p-values. Cumming encourages us to describe results in terms of estimation & plausible effect-sizes rather than relying only on, or defaulting to, the p-value. this sounds like a good idea: to improve statistical teaching & to understand why scientists don't understand statistics. but after a generation of scientists living & working under the regime of 'The New Statistics' & after most of several careers spent on this crusade, can we say, collectively, either that we as scientists, or the institutions we work for, or the journals we publish in, have any better understanding of statistics? not really.
on Monday I had the distinct & grating displeasure of watching one of Geoff Cumming's youtube videos about how confidence intervals are great, but p-values are terrible. i'd been provided with a link in a 'response to reviewers' relating to an otherwise good manuscript that i'm reviewing. in the first version of the reviewed paper, the authors had stated that linear mixed models compare experimental conditions using confidence intervals around estimates, & that this is in sharp contrast to traditional p-value tests, which only tell us about the null hypothesis. they cited several of Cumming's papers in support of this unusual claim. baffled, i pointed out in my review that LMMs are not special or fundamentally different from other forms of the general linear model – such as t-tests or ANOVAs – & that confidence intervals & p-values are derived from the same data & assumptions. a pretty straight-foward, vanilla comment, i thought. in the next version of the manuscript, the authors' response was to disagree: 'p-values tell us nothing, while confidence intervals tell us about the true population mean'. & if i didn't believe them, here was a link to a dodgy – & quite wrong – introductory statistics video from Professor Cumming. this made me annoyed, to say the least. listeners: if you don't want a one-thousand-word essay on p-values & confidence intervals back from your reviewer, don't send them links to inane, populist, anti-statistical click bait in place of engaging with the actual serious point being made. that tip is for free. in my review i suggested that the authors were taking the position of a 'statistical extremist'.
why have so many researchers been radicalised by Cumming, by confidence intervals, by linear mixed models & by the statistical software called R? it's a real puzzle. email talk at the errorbar.com if you have a theory.
thou shalt become a Bayesian!
i once examined a good PhD thesis that used Bayesian statistics instead of the more typical flavour of stats, which you might call 'Frequentist'. i don’t fully understand the theoretical differences between the two approaches, but in general, the benefit of Bayesian statistics seems to be that you can provide the statistical analysis with a realistic expectation about what the effects you are studying might look like. rather than testing the 'null' hypothesis over & over again, you can instead specify a more plausible range of hypotheses, test whether your data is more compatible with one or another hypothesis, then update your models to better account for all the available data.
this is all good: Bayesian statistics have very clear uses & are complementary to other approaches. but this statement alone – that they are complementary approaches – would provoke great, likely confected, outrage in some corners of twitter & even in some corners of the real world. what struck me about the PhD thesis, & about the researcher's use of Bayesian statistics, was that it came along with a kind of fundamentalist view about how statistics should be done. p-values from null hypothesis significance tests were bad; p-values from Bayesian tests were good. confidence intervals from Frequentist statistics were bad; credible intervals from Bayesian statistics were good. statistical significance with p<.05 is bad; statistical evidence with Bayes Factor greater than 3 is good.
during the viva i asked about this approach, & why were we seeing Bayes factors rather than other statistics in this thesis. it turns out that the PhD students at this department had all attended a Bayesian statistics workshop. from that point onwards, it seemed that most of the researchers had switched to using Bayes. what surprised me most here was the researcher's certainty that Bayes is better. Frequentist bad, Bayes good. populism.
the candidate was rightly awarded their PhD. but after the viva i remained curious about the benefits of Bayes over my usual statistical approach. so I took all the Bayes Factors reported in the thesis & plot them against the t-test statistic, which [was also reported, or which] I calculated from the same data. the relationship was monotonic, curvilinear & continuous: there was an almost-perfect relationship between Bayes Factor & the t-test. the few deviations from this relationship that existed in my graph were likely due to rounding errors or typos along with several important cases where the default options of the Bayesian analysis had been changed.
[editorial correction: i have found the graphs that i plot after examining this thesis. in the first graph i compared the corrected eta parameter with the t-test, in the second, the Bayes Factor with the F statistic (or t^2). they were near-perfectly correlated on a linear-log scale:]
looking back, i now see that the student had been radicalised by this Bayesian workshop into thinking that – somehow – Bayesian statistics are fundamentally different from – & better than – other approaches like the t-test. yet my graph was showing me a near-perfect curvilinear relationship between the two statistics. just as confidence intervals are one-to-one transformations of the p-value, & just as p-rep is a one-to-one transformation of p, so the Bayes Factor is a one-to-one transformation of the t-statistic, so long as the assumptions & prior remain constant.
thou shalt use R!
there is a move across the sciences – & in psychology in particular – to use different software packages to run statistical analyses. despite its popularity, SPSS has always been a bit clunky, user-unfriendly & by default produces sub-optimal outputs & graphs. moving away from its lumbering legacy is welcome, but what should we move towards? many young scientists are being pushed from running ANOVAs in SPSS towards multi-level mixed models in R. in theory, this could be a good move that may encourage deeper statistical understanding & better practice. in theory.
my heart always sinks a little when I see that 'R' was used in a paper that i'm reviewing – it usually means that the statistics will be opaquely reported, often with a single Chi-square test statistic & a single degree of freedom in replace of the full model typically reported with ANOVA; the graphs produced are often unhelpful – a number of lines sloping up or down on the page; & it often comes with no guarantee that the authors actually know what they are doing. to give credit to the targets of my anger in Part I, this is one thing on which i can agree with Gelman & Brown – there are [or at least, seem to be] no good textbooks & little available guidance on how to do mixed model analyses in R.
dogmatically promoting any particular statistical software over any other – as I have heard many times – is a populist appeal that the software is more important than the hypothesis, or the data, or even the user's understanding of statistics. if you can test your hypothesis using a t-test in Microsoft Excel or using a sign test with paper & pencil, then you should be rewarded for designing a simple research question rather than over-complicating the analysis with the latest package of statistical wizardry.
discussion
in Monty Python's The Life of Brian, there is a scene which Brian begins as a protestor trying to avoid the Romans, & ends as the Messiah. to avoid detection, Brian poses as a prophet. at first a weak & confusing orator, he gains rhetorical power during his false prophesy & is just about to reveal what salvation will be given to those who convert their neighbour's ox. but the prophesy is cut short as the Romans pass on by.
frustrated by not knowing what they will be given in the afterlife, a growing crowd follows Brian, hanging on his every word, retrieving his gourd & his shoe, inventing stories & mythologies to convey the significance of these artefacts. the crowd divides into two factions: "follow the gourd!" no, "follow the shoe!" as the fanatical crowd chases the retreating Brian into the desert, a wise, elderly man is left among them. "Stop! Stop! Stop, I say! Stop! Let us -- let us pray."
the parable of Brian of Nazareth directly applies to the post-truth, populist polarisations introduced by some scientists' responses to the replication crisis. some scientists want to follow the gourd, others the shoe. yet others want us just to stop, to take a breath & to pray.
if we do decide to follow the populists into the desert & to use their heuristics to improve science, then who should we follow? should we follow the replicators or the preregisterers? the redefiners, the justifiers, or the abandoners? should we follow the pointillists or the intervalists? the hypothesis testers or the estimators? the Frequentists or the Bayesians? from episode 31, should we follow those who seek to simplify or to complificate science? should we follow those who seek to explore their data or those who seek to confirm their hypotheses?
my conclusion is that you don't have to follow any of these dogmas & you don’t have to choose a side. they all have their merits & their applications. the only thing that really matters is that when doing & evaluating science you avoid descending into the kinds of post-truth polarisation that the scientific populists have used to lead you into temptation.
it's not too hard: read the fucking paper, give credit to the authors & believe their words as given, don't hallucinate variables [or motivations], don't accuse authors of scientific misconduct based only on the results of the study, check your biases, don't claim expertise where you have none. just, you know, be nice.
the scientific populists – often in podcasts – make all of these mistakes. & when you scratch the surface of their critiques, you may realise: it's all surface. there's nothing at all beneath their criticism: they say that this study is bad because the results show the effect in women but not in men, & that implies p-hacking. they say that study is bad because it used EEG; EEG produces very noisy data, & that implies that the results can be ignored. they say that this researcher is bad because the distribution of p-values in their published papers is non-uniform & has a little bump.
scientific populism depends upon this superficial appearance of critique – pseudoscholarship. the populists rely on their audience not knowing better, not being experts, not reading the paper, & not fact-checking the populist. don't look it up.
in this journey through the podcasting populists of scientific reform, i have come to realise that, far from being the advance-guard of the movement, leading us in our struggle towards better methods, better practices & better communication of our work, these scientific populists present us only with a meagre menu of options, a thin hubristic suite of heuristics that, on their own, solve no problems at all.
just like their political cousins, scientific populists start from the claim that science is broken. they see scientific poverty, degeneration & corruption wherever they look – & they look a lot. they criticise the traditions & institutions of modern science & they blame them for all the problems they have identified. they offer their followers a simplistic solution to fix these problems: preregister, use confidence intervals, replicate, redefine significance, become a Bayesian. they polarise science by drawing a line in the sand – you're a pre-registerer or you’re not; exploratory or confirmatory; .005 or .05; Bayesian or Frequentist; R or Excel. either you're with us or you're against us. & like their political cousins, they embrace post-truth science by not reading the paper, not limiting their critique & not fact-checking their own claims. they embody the three modern political problems identified by The Rest is Politics podcast: polarisation, post-truth & populism.
conclusion
to conclude the second part of this three-part episode on problems in the scientific reform movement: i have argued that these questionable open science practitioners [QUOSPs] are populists. first, they identify a problem in science. second, they see evidence of this problem everywhere. third, they propose a simplistic solution to this problem. fourth, they use social media, podcasts & newspapers to popularise their analysis of the problem & their polarising, populist solutions. they win many followers both within & outside of science. their simple solutions to our complex problems can seem very attractive. to paraphrase Karl Marx, 'scientific populism is the sigh of the oppressed scientist, the opium of the people.'
my new view of the scientific populists is that while they are very good at self-promotion, they're not very good at science.
epilogue
regardless of when the replication crisis is said to have begun, science has always lurched from crisis to crisis, some larger & more consequential than others. regardless of some scientists misunderstanding p-values, many have a deep understanding of them & much to say in their defence. regardless of some scientists being seen committing fraud, p-hacking, or selectively-reporting their data, there have always been good scientists doing careful, unbiased work. regardless of some scientists telling stories about p-curves or sample sizes, there are many scientists doing sound, cumulative problem-solving & thinking deeply about what they do. in the final part of my epic account of problems in the open science movement, i will turn & give credit to the approaches & solutions that – i believe – might actually help us make progress.
[🎶 OUTRO: "Cosmopolitan - Margarita - Bellini"by Dee Yan-Kee 🎶]
it's closing time at the error bar, but do drop in next time for more brain news, fact-checking & neuro-opinions. take care.
the error bar was devised & produced by Dr Nick Holmes from University of Birmingham's School of Sport, Exercise and Rehabilitation Sciences. the music by Dee Yan-Kee is available from the free music archive. find us at the error bar dot com, on twitter at bar error, or email talk at the error bar dot com.ε