A small number of diligent people is rooting around scientific records for suspicious data in clinical research. Anaesthetist for England’s National Health Service John Carlisle is one of them. He has spotted numerous problems in a large number of research papers, and his findings have led to hundreds of retracted papers, due to both misconduct and mistakes. Even a leading medical journal has changed its practice after a few Carlisle’s discoveries.

Over the past decade, John Carlisle inspected numerous spreadsheets with the ages, weight and height of people in various clinical trials, and unfortunately, some of them never actually existed. He investigated trials from a wide range of health issues, from the benefits of specific diets to guidelines for hospital treatment. Carlisle’s thorough data analysis brought down three of the six scientists with the most retractions worldwide.

“His technique has been shown to be incredibly useful,” Paul Myles, director of anaesthesia and perioperative medicine at the Alfred hospital in Melbourne, Australia, who has worked with Carlisle to examine research papers, said for Nature. “He’s used it to demonstrate some major examples of fraud.”

Although critics argue that it has sometimes led to the questioning of papers that aren’t flawed, Carlisle believes that he is helping to protect patients. Many researchers who check academic papers share his opinion that journals and institutions should be doing much more to spot mistakes.

“I do it because my curiosity motivates me to do so,” he said, not because of an overwhelming need to uncover wrongdoing: “It’s important not to become a crusader against misconduct.”

More than ten years ago, Carlisle became suspicious of too much “clean“ data and analyzed results published by Yoshitaka Fujii, who worked at Toho University in Tokyo. In a series of randomized controlled trials (RCTs), Fujii claimed to have examined the impact of various medicines on preventing vomiting and nausea in patients after surgery. Carlisle used statistical tests and in 2012 showed that, in many cases, the likelihood of the patterns having arisen by chance was very small. Fujii lost his job and holds a record of 183 retracted papers. Other researchers soon cited Carlisle’s and used variants of his approach for data analyses.

Carlisle believed that the field of anaesthesia is hardly the only one at fault. So he took eight leading journals and randomly checked through trials they had published. In 2017 he raised a lot of dust with an analysis published in the journal Anaesthesia stating. He had found suspicious data in 90 of more than 5,000 trials, at least ten of these papers have since been retracted and six corrected. A high-profile study published in The New England Journal of Medicine (NEJM) was also corrected.

This year, he warned about anaesthesia studies from Mario Schietroma, University of L’Aquila in Italy. He spotted suspicious similarities in the raw data for control and patient groups in a few of Schietroma’s papers. The World Health Organization (WHO) cited Schietroma’s work when, in 2016, it issued a recommendation that anaesthetists should routinely boost the oxygen levels they deliver to patients during and after surgery. WHO revised its recommendation to “conditional”. Allegedly, investigations are underway.

Carlisle relies on the fact that real-life data have natural patterns that artificial data struggle to replicate. In RCTs, Carlisle analyzes the baseline measurements (height, weight) that describe the characteristics of the control group and the intervention group. Volunteers should be randomly allocated to those groups. Therefore, the mean and the standard deviation for each characteristic should be about the same, but not too identical.

Nature described Carlisle’s method: He first constructs a P-value for each pairing: a statistical measurement of how likely the reported baseline data points are if one assumes that volunteers were, in fact, randomly allocated to each group. He then pools all these P values to get a sense of how random the measurements are overall. A combined P-value that looks too high suggests that the data are unusually well-balanced; too low could show that the patients have been randomized incorrectly.

This method is not perfect. Some papers pop out as incorrect but are not, so further analysis is necessary. Statistical checks demand that the variables in the table are independent, but the reality is often not flawless (e.g. height and weight are linked). Nevertheless, applying Carlisle’s method is a good first step.

“It can put up a red flag. Or an amber flag, or five or ten red flags to say this is highly unlikely to be real data,” said Myles.

Carlisle does not attribute any cause to the possible problems he identifies. However in 2017, when his analysis of 5,000 trials appeared in Anaesthesia, it provoked a strongly-worded response from many scientists and journal editors.

The need and demand for double-checking published data are on the rise. A few more methods have emerged in the past years as well. At least two journals (Anaesthesia and NEJM) now use the statistical checks as part of the publication process for all papers.

“We are looking to prevent a rare, but potentially impactful, negative event,” a spokesperson for the NEJM said. “It is worth the extra time and expense.”

Michèle Nuijten, who studies analytical methods at Tilburg University in the Netherlands, has developed a “spellcheck for statistics”. It scans journal articles to check whether the statistics described are internally consistent. Statcheck has its limitations, though. It runs only on the strict data-presentation format used by the American Psychological Association.  

Nick Brown, a graduate student in psychology at the University of Groningen, Netherlands, and James Heathers, who studies scientific methods at Northeastern University in Boston, Massachusetts, have used a program called GRIM to double-check the calculation of statistical means. GRIM works only when data are integers.

Some text-mining techniques have allowed researchers to assess papers and find “small adjustments”. One possible use is the investigation of P-hacking in which data are tweaked to produce significant P values.

Carlisle checks are laborious, time-consuming, and not universally accepted. Automation and development of new methods are necessary to check even a significant part of about two million papers published across the globe each year. Carlisle saw many cases of fraud, repeated attempts of fraud, typos and mistakes, and think he might have an idea what drives some researchers to make up their data.

“They think that random chance on this occasion got in the way of the truth, of how they know the Universe really works,” he says. “So they change the result to what they think it should have been.”

Learn more about problems with the scientific publishing industry in the video below:

By Andreja Gregoric, MSc