OPEN SCIENCE & ITS ENEMIES, PART I   ⏱26:50

in this episode, i cover all the brain news i've read this month, & present the first of a 3-part essay criticising some sections of the open science movement

episode #39, 20th May 2024 #openscience #reform #critique #commentary   image source

  ⇦ listen here ⏱26:50


  ⇨ transcript here


our main brain stories this episode...

Part I: The p-circlers

err, there's, there's no brain news. this is a very busy month for academics - dissertations, exams, preparing for conferences. i've not had time to read the brain news, but i'm pretty sure that the media coverage of it would have been random at best, & poor at worst. & that's the news.

instead, this month, & for the next two episodes, i am going to put down in writing - & lay down in audio - some of my thoughts on some problems within the open science movement. i've been meaning to do this for a while. & i've named the three episodes by playing on the title of a book called 'The Open Society and its Enemies' by Karl Popper, a philosopher best known for his work on the logic of science & of swans. Popper's book, which i haven't read, was about totalitarian states. my podcast mini-series is about some of the not-too-similar behaviour of scientists. if you're a philosopher who would like to compare & contrast my podcast with Popper's book, then please email talk@theerrorbar.com.

Open science and its enemies. Part I: the p-circlers

this week's episode has been prompted by events, & i am producing a special series of three episodes on problems in the open science movement. & like the rest of the podcast, they are my opinions only, but they have been on my mind for some time - years, in some cases.

before the criticism of some of the open science movement, some background. in 2011 there was what historians of science might end up calling some kind of an 'awakening' within a small but significant section of the scientific community. it has now come to be known as 'the replication crisis', & wikipedia currently dates this to the start of the 2010s. & my understanding of the replication crisis is that it started in certain areas of psychology: in social & personality psychology in particular, but also in other areas of experimental psychology. & it started with some much-focussed-on papers, like a paper by Daryl Bem called 'feeling the future'. published in March 2011, in that paper, Bem claimed to find multiple sources of experimental support for human pre-cognition - that is, for typical healthy participants being able to predict the future.

to be clear: humans can't predict the future; instead, it is now believed by a number of members of the open science community - & wider - that the data in these experiments seems to have been manipulated &/or selectively reported in order to 'tell a good story'.

the awakening that this paper provoked has forced some scientists to ask how it was even possible, in the early 21st century, that we were still publishing & reading papers - apparently without irony - about paranormal concepts & seemingly-impossible results. extra-ordinary claims should require extra-ordinary evidence, they rightly claim. the forensic assault on Bem's paper which followed, to find its faults, & the sharing of this investigation amongst a whole cohort of scientists, seems to have provoked a large number of them to unite, to declare a crisis, & to foment a revolution.

so let's assume for my arguments in these three episodes, that the replication crisis really did begin in 2011. & that there was a large group of people simultaneously outraged by Bem's paper - & other papers like it - & who were determined to make science great again. they wrote books, they formed societies, they discussed science on twitter & on podcasts. & they've had - to be fair - quite an amazing effect. so let there be no mistake: i have great respect for this revolution, i agree in large part with its methods, results & conclusions, & i believe that science - in general - is in a better state because of it.

but there are problems in this movement. & this series of three episodes will describe some of those problems from my perspective, & i'll give my view on them. when listening to my bilious outrage on these problems, please remember that there is much to praise in the open science movement, & i'll try to direct you back to the good when i can.

in Part I, i focus on what i am calling the p-circlers. p-circling is a pejorative term used to describe scientists who search the published literature looking for reports of unlikely effects - often in personality & social psychology. when they find these unlikely-looking effects (for example that healthy young adults can predict the future), the p-circlers hover over the paper, searching for p-values that look 'too big' (or equally those that look 'too small'). based on these unusual- or unlikely-looking p-values (or other statistics reported in the paper), the p-circlers decide that the scientific work must be low quality, & then they search for reasons why these p-values cannot be right.

another pejorative term for this & similar practices is 'reverse p-hacking'. in the regular kind of forward p-hacking, a researcher will find a p-value that is quite small, but not small enough for them to make a conclusion in line with their hypothesis (or at least in line with anything that is publishable). to make this p-value smaller, the scientist then makes a series of adjustments to the data &/or the analysis in order to change the p-value from being 'not publishable' to 'publishable'. this is p-hacking because the researcher hacks away at their data & the analysis until the p-value is looking more like the beautifully-sculpted result in the imagination of the scientist.

reverse p-hacking, then, is when a researcher finds a small p-value that they do not like, & they search for any method of analysis that will make that p-value large enough to be safely ignored. & within the scientific reform movement, while there has been a lot of focus on p-hacking, there has been much less focus on reverse p-hacking, or p-circling.

both p-circling & p-hacking are characterisations or cartoons of behaviour - it's very likely that all of us scientists have engaged in one or both of these practices at different times & to different extents. & it's also unlikely that many scientists engage in the very worst kinds of p-circling or p-hacking. but within certain sections of the open science movement, p-hacking seems to have become a terrible crime. & the punishment for this crime of p-hacking is for the inquisitors to invite much criticism on social media, for them to write a commentary or a preprint decrying the questionable research practices of the researchers, to post the accusations on pubpeer.com, or - ahem - to put them into a podcast episode.

if there *has* been genuine p-hacking, then that is of course a bad thing, perhaps worthy of a commentary, perhaps worth inviting the researchers under investigation to comment or respond to the accusations. but with social media as it can be, & the social skills of many academics as they seem to be, this robust, respectful & open exchange of views often doesn't happen. instead, a small group of researchers may sling some mud at the accused on social media & the pile-on begins.

i've seen quite a few pile-ons in my time on twitter & they are not a pretty sight. i most remember the one about the rubber hand illusion, which i covered in episode 10 of this podcast. but there have been others. i am usually just-outside the echo chamber in which this pile-on occurs, likely because i have already blocked - or been blocked - by the pile-on proponents; instead i may just see the weary tweets of other researchers, returning from the front lines of the battle, having escaped the maelstrom of twitter war.

now that i've escaped twitter almost entirely - i have decreased my dose to a single brief exposure once-a-week - i have much less awareness of this kind of open science dialogue. my academic life is much improved because of it. i have more time to focus on doing & reporting my science rather than getting drawn into whichever poor post-doc is piling-on or being piled-on. i strongly encourage researchers to step away from all of that. i moved to bluesky for a quieter life, away from elon musk's climate denier bots & porn fairies.

& it was when i was on my bluesky retreat this week that the p-circlers of open science found me. thankfully, no pile-on has occurred - yet - & bluesky is still blissfully-low-engagement. but all the elements were in place for a renewed p-circler pile-on. so let me describe the background to what happened this week.

one week before Christmas, in December 2023, Aungle & Langer, two psychological scientists at Harvard University, published a paper called "Physical healing as a function of perceived time". in that paper, 33 women were given three versions of an intervention, in which they placed suction cups on their body - now apparently this is a thing that people do, & coincidentally this week, one of the other swimmers in the university of birmingham's swimming pool had also seemed to have been cupping. after cupping, in this study, the participants completed a healing questionnaire 7 times during what they were told was either a 14-, 28- or 56-minute period.

the trick in the experiment was that the experimenters placed a clock on a tablet next to the computer desktop, & the clock showed the number of minutes that had elapsed. but that clock was running at different speeds in the three repeated interventions - one was running at half normal time, one at normal, & one at twice normal time. & after every 4 real minutes that had elapsed, participants completed the healing questionnaire. to reinforce the, the lack of time perception in this study, participants' personal belongings had been taken off them at the start of the experiment, to 'minimise distraction', they were told. the participants' cupped body part was photographed carefully, both just after cupping & at the end of the 28 minute period. the photographs were rated by 25 anonymous, independent raters recruited on the mechanical turk platform. they were asked to rate a series of randomly-ordered photographs of the cupping after-effects, & the scale went from 'not at all healed' to 'completely healed' on an 11-point scale from 0 to 10.

now reading this paper, as i did today, the description of this study is careful, lots of controls have been implemented, & in general this feels to me like a smart, well-controlled manipulation. the authors have made their data & analysis code freely available as supplementary materials, alongside the article. the study was not pre-registered, as some might need, but overall, this sounds like good practice in the age of open science.

the main results of the study were that the independent judges rated their perceptions of wound healing as greater for photographs taken at the end of the fast time condition than the other two conditions. when participants perceived that 56 minutes had passed by - rather than 28 - their skin was rated as being more healed by these independent raters. alongside these main results based on the study's main dependent variables, the authors also collected a number of personality measures. some of these were collected on a repeated basis along with the healing questions, but most others were done as part of an 'enrolment survey'. about this part, the authors wrote, & i quote:

"Participants completed several personality and psychosocial measures of wellbeing at their time of enrolment to maintain our cover story while also allowing subsequent analyses to account for factors such as stress, anxiety, & depression known to be related to wound healing [ref36]... " (page 3)

& the authors did indeed explore the data by including these variables in the model, but they made no difference, & there was no theoretical reason to include them anyway. they wrote:

"Including personality trait variables did not significantly improve the model, & we did not have any theoretical reason to retain them in our analysis. We measured personality traits only to maintain our cover story" (page 5)

great. it's very often the case that reviewers might ask for particular analyses to be performed. & in my own papers, i make a point of saying 'following a reviewer's request, we did this, or that'. now i don't know anything about the reviewers in this case, & it's very possible that some of those analyses were requested by reviewers, & some may not have been. we don't know.

overall then, the the result is very simple: participants' perceived time affects how independent raters perceive their healing. a carefully-manipulated cover story; a repeated measures experiment; fully-open data & code; there's no preregistration & there's some limited exploration of alternative analytic approaches. overall, the result may well sound surprising, but there's nothing in this report that could possibly justify accusations of p-hacking or questionable research practices, right?

wrong. enter the p-circlers.

the reason that this story appears - at length - on the error bar this month, is that i was browsing the twitter-refuge bluesky at 7am on wednesday morning. i saw a post by Dr Nick Brown, a well-known open science reformist & self-described 'data thug'. he posted news of his latest work, as second author to Professor Andrew Gelman, in a preprint offering a critique of the cupping paper, & a tutorial on running multi-level general linear mixed models, in the statistical software, R. from what i'd already read of Professor Gelman's prior work, i decided to read the commentary, hoping to learn something about statistical modelling - as far as i can tell, he is seen as an expert. on the other side, i also knew of Dr Brown's previous work, & especially his open science credentials, & this made me skeptical on balance, but open at least to learning something.

what i read on that wednesday morning, & re-read that wednesday afternoon, was one of the poorest, laziest pieces of academic commentary that i've come across. it was astounding in its language & in its incorrect & disparaging claims about the cupping article. i said not quite this much on bluesky on wednesday morning, but later i posted my own annotations on their preprint, which Dr Brown acknowledged on friday. i remain genuinely puzzled about how so-called open science advocates could produce commentary like this. & i shall read some direct quotations from their preprint. the emphasis will be mine, & the selection of course is mine, & i will interpose my responses between the quotes.

Gelman & Brown begin their commentary by outlining the phenomenon of the 'replication crisis' in social psychology, just as i have done in the introduction to this podcast, & on this we agree. the purpose of their commentary, they say, was to, & i quote:

"explore some general issues by examining in detail a recent psychology paper & investigate problems that might not be apparent in a casual reading but nonetheless lead to unreplicability." (that was on page 2 of their preprint)

the paper in question is the cupping paper, which i've described, by Aungle & Langer (2023). Gelman & Brown present no evidence that this paper is 'unreplicable'. it was only published 6 months ago, so it's unlikely that replications have even been attempted. perhaps Gelman & Brown meant: that this paper 'should be replicated'? the commentators describe the results of the study as, & i quote:

"a flawed research project [...]" with "speculative or implausible" background, "a noisy experiment" that "could have been arranged in a way that convinced authors and reviewers alike that they were seeing strong evidence." (that's from page 3 of their preprint).

the commentary authors do not say "in detail" what was flawed, what was speculative, what was implausible, & what was 'noisy' about the experiment. as a Professor of Statistics, the first author should know - very well - that all variables are - well, variable. all research data includes noise or measurement error. so my interpretation of their choice of words is that the commentators have decided that they do not like the result of the cupping study, & therefore the data must be described as 'noisy' so that they can safely explain-away the effects. the commentators present no evidence that the data are 'noisy', nor any definition of what is meant by 'noisy'. i assume they mean 'variable', in which case it means nothing at all. all data are variable, & that's why we do statistics. so we must ignore all of the claims in the commentary that the data are noisy, & just focus on the statistics.

in the next part the commentators say that textbook guidance on running linear mixed models is unclear, & researchers often get the analysis wrong. indeed, & it was presumably, in part, because of some very-high t-test statistics presented in the cupping paper (t-test values as large as 7 or 10, for the statistics nerds in the audience), that the commentators decided to re-analyse the data from the cupping paper. they were correct to be suspicious - Aungle & Langer (2023) had indeed analysed the data incorrectly - failing to account correctly for random effects & random slopes in the model - & the original t-test value of 10.7 was reduced in the re-analysis to 3.0. this sort of error is common even in simple statistics like t-tests, correlations, & ANOVAs. i see it quite often, for example, when data from a repeated measures design is pooled into a single correlation or other analysis. so, a correlation is done, for example, with twice the number of datapoints than there were participants in the experiment. this over-estimates the degrees of freedom & pools within- & between-variance. it is commonly done, but it is wrong. & the problems described by Gelman & Brown are similar.

so from the re-analaysis, the cupping effect had been very much over-estimated, but only in terms of its statistical strength. the summary data were presented clearly in a 'violin plot' in Figure 1 of the paper. all the individual means were clearly shown, with means & standard errors for each condition, & all the data were freely available. from that figure, the statistical effect of perceived time looks quite strong. & indeed, the re-analysed t-test statistic from Gelman & Brown is 3.0. for stats nerds, that corresponds to a p-value of 0.005, & - by long-standing statistical convention - this is a significant statistical result, in the hypothesised direction, showing that the manipulation was effective & therefore which deserves further attention. the original analysis may have been wrong, but the result stands. do Gelman & Brown acknowledge that, despite the incorrect analysis, the experiment was successful & clearly rejected the null hypothesis? no, they don't. Gelman & Brown later write, & i quote:

"any effect of the manipulation on healing is estimated to be highly variable, sometimes positive & sometimes negative." (& this is on page 6). the commentators provide no evidence for why this result is apparently 'highly variable'. seemingly in order to justify their rejection of this significant experimental result, Gelman & Brown seem to assume that the only reason that this carefully-designed experiment worked at all in the hypothesised direction, must instead be due to some questionable research practices. thus, they write:

"the statistically-significant result that appeared is one of many comparisons that could have been made. data were also gathered on participants' anxiety, stress, depression, mindfulness, mood, & personality traits, implying many possible analyses that could have been performed" (page 6).

of course, the authors of the cupping paper *could* have analysed the data that way, but they clearly stated both the reasons for including these variables & they explicitly referred to an analysis that did include these covariates, which they reported made no difference to their conclusions. & this is before we recall - from the methods section - that these between-subject variables were collected at the start of the experiment as part of a cover-story about personality. the healing questionnaire was administered seven times during the experiment, but most importantly, the before-and-after photographs provided the primary outcome variable. these photographs provided the ratings of healing. clearly, this repeated measures study is about healing, not personality. Gelman & Brown are being extremely unfair to the authors in claiming that they could have analysed the data another way. this is reverse p-hacking, pure & simple.

additional comments from Gelman & Brown do not help their case. at one point they claim that, & i quote: "three standard errors away from zero does not necessarily represent strong evidence of an effect" (on page 6). three standard errors is a t-score of three. yet earlier, they said that "researchers are lucky to find" effects greater than two standard errors. so, is it a small effect or a big effect? we don't know. later, again, they say that "differences [of] two or three standard errors from zero, [...] would seem to be unlikely to occur by chance alone." again, saying that 2 or 3 is a large effect. this lack of consistency in interpreting something as simple as the t-test statistic is surprising from these commentators, to say the least. there are other points at which the commentators simply state that data are 'noisy' or 'highly variable' for no apparent reason other than that they do not seem to like the conclusions that are being reported. & they save their worst prejudice for last, by concluding (& the emphasis is mine here), i quote:

"all of this occurs in the context of what is undoubtedly a sincere and highly motivated research program. the work being done in this literature can feel like science: a continual refinement of hypotheses in light of data, theory, & previous knowledge. it is through a combination of statistics (recognizing the biases and uncertainty in estimates in the context of variation and selection effects) & reality checks (including direct replications) that we have learned that this work, which looks and feels so much like science, can be missing some crucial components. this is why we believe there is general value in the effort taken in the present article, to look carefully at the details of what went wrong in this one study & in the literature on which it is based".

this conclusion requires no comment or critique. it is a self-parody.

to conclude Part I: some of the worst enemies of the open science movement have come from within the open science movement itself. they reveal themselves to be lazy readers of the target literature; they arrogantly dismiss the claims of the scientists they critique, & they ignore evidence that contradicts their presuppositions; they are hypocritical in their accusations of 'selectivity' & 'bias'; they are inconsistent in deciding what is a 'small' & what is a 'large' p-value or t-statistic or effect-size, or what data is 'noisy' & what is not; they dismiss entire fields of research with the wave of a hand & by printing a few equations. they claim to be standing up for real science & real scientists against pseudoscience & the pseudoscientists. but in selectively commenting on this work, they pump out biased, self-serving, self-contradictory, selective, condescending prose that can only be described as pseudo-scholarship.

in Part II of Open science and its enemies, i focus on the role of podcasts & podcasters in the open science movement. it won't be quite as dramatic as this one, but please listen anyway.