Proposal for an extension to PRISMA for systematic reviews that are based on clinical trial registry entries

The advent of clinical trial registries has enabled a new means for evaluating and synthesizing human research, however there is little specific guidance from PRISMA for researchers who wish to include clinical trial registry entries in their systematic reviews. I would suggest an extension to PRISMA to directly address these gaps.

My main suggestions would be to explicitly require researchers to:

  • Justify which clinical trial registries were included
  • Specify retrieval methods (“downloaded from” is not enough)
  • Distinguish between human-curated vs machine-interpreted data
  • Specify details of procedure for human-curated data, or code and quality control efforts for machine-interpreted data
  • Provide the decision procedure for matching registry entries to publications

I have provided explanations, examples and code below where I felt it was appropriate.

Choice of sources

There are currently 17 primary clinical trial registries other than listed by the WHO that meet the 2009 WHO registry criteria. Most reviews of clinical trial registry entries only include registry entries from and few provide any rationale for their choice in trial registry. This is a small enough number of registries that it is reasonable to ask authors to specify which ones were searched, or to justify why any were excluded.

Specification of retrieval methods

There are at least four distinct ways to download data from alone:

  1. The entire database can be downloaded as a zipped folder of XML files from
  2. A CSV or TSV file containing a table of search results can be downloaded from the web front-end of
  3. A zipped folder of XML files can be downloaded from the web front-end of
  4. The API can be queried for an XML response.

These methods do not provide the same results for what may appear to be the same query.

For example, a search performed on the web front-end of for the condition “renal cell carcinoma” returns 1745 results. (See Code example 1.)

A query to the API for the condition “renal cell carcinoma,” however, returns 1562 results. (See Code example 2.)

These are both searches of for the condition “renal cell carcinoma,” but there is a very different set of records that are produced in each case. The difference here is that the web front-end for also includes search results for synonyms of “renal cell carcinoma” in order to ensure the highest sensitivity for searches made by patients who are searching for clinical trials to participate in.

Similarly, the web front-end will often include results for related drugs, when searching for a particular drug name. E.g. a search for temsirolimus also returns results for rapamycin.

PRISMA currently tells researchers to “Present full electronic search strategy for at least one database, including any limits used, such that it could be repeated.” More specific guidance seems to be required, as (in my experience) the bulk of systematic reviews of clinical trial registry entries do not distinguish between downloading results via the API vs the web front-end.

Human-curated data vs machine-interpreted data

Post-download screening steps

Screening clinical trial registry entries for inclusion or exclusion can often be done at the point of searching the registry, however in many cases, the search tools provided by a clinical trial registry do not have exactly the right search fields or options, and so post-download screening based on data or human judgements is common. It is often not clear which screening steps were performed by the registry search, which ones were post-download filters applied to the data set, and which were based on the judgement of human screeners. To ensure transparency and reproducibility, there should be specific instructions to coders to specify, and to disclose the code for doing so, where any was used.

Extraction of clinical trial data

In a traditional systematic review of clinical trials, trial data is extracted by human readers who apply their judgement to extracting data points to be analyzed.

Reviews of clinical trials that are based on clinical trial registries often include analyses of data points that are based on machine-readable data. For example, answering the question “What is the distribution of phases among trials of renal cell carcinoma in sunitinib?” can be done in 5 lines of R code without any human judgement or curation at all. (See Code example 3.) However, there are other questions that would be difficult to answer without human interpretation, e.g. “Does the rationale for stopping this trial indicate that it was closed for futility?”

To make it more complicated, there are questions that could in principle be answered using only machine-readable information, but where that interpretation is very complicated, and in some cases, it might be easier to simply have humans read the trial registry entries. E.g. “How many clinical trials recruit at least 85% of their original anticipated enrolment?” This question requires no human judgement per se, however there is no direct way to mass-download historical versions of clinical trial registry entries without writing a web-scraper, and so a review that reports a result for this question may be indicating that they had human readers open the history of changes and make notes, or they may be reporting the results of a fairly sophisticated piece of programming whose code should be published for scrutiny.

These distinctions are often not reported, or if they are, there is not enough detail to properly assess them. Code is rarely published for scrutiny. Whether human-extracted data were single- or double-coded is also often left unclear. A result that sounds like it was calculated by taking a simple ratio of the values of two fields in a database may actually have been produced by a months-long double-coding effort or the output of a piece of programming that should be made available to scrutiny.

Data that was never meant to be machine readable, but is now

There are some data points that are presented as machine readable in clinical trial registries that were never meant to be interpreted by machines alone. PRISMA assumes that all data points included in a systematic review were extracted by human curators, and so there is a particular class of problem that can arise.

For example, in clinical trial NCT00342927, some early versions of the trial record (e.g. 2009-09-29) give anticipated enrolment figures of “99999999”. The actual trial enrolment was 9084. The “99999999” was not a data entry error or a very bad estimate—it was a signal from the person entering data that this data point was not available. The assumption was that no one would be feeding these data points into a computer program without having them read by a human who would know not to read that number as an actual estimate of the trial’s enrolment.

This can, of course, be caught by visualizing the data, checking for outliers, doing spot-checks of data, etc., but there is currently no requirement on the PRISMA checklist to report data integrity checks.

Matching registry entries to publications or other registry entries

Not all systematic reviews that include clinical trial registry entries are based on registry data alone. Many are hybrids that try to combine registry data with data extracted from publications. Clinical trials are also often registered in multiple registries. In order to ensure that clinical trials are not double-counted, it is necessary in some cases to match trial registry entries with publications or with entries in other registries. For this reason, any review that includes more than one trial registry should be required to report their de-duplication strategy.

Trial matching or de-duplication is a non-trivial step whose methods should be reported. Even in cases where the trial registry number is published in the abstract, this does not necessarily guarantee that there will be a one-to-one correspondence between publications and trial registries, as there are often secondary publications. There is also a significant body of literature that does not comply with the requirement to publish the trial’s registry number, and the decision procedure for matching these instances should be published as well.

PRISMA does not require that the decision procedure for matching trial registry entries to other records or publications be disclosed.

R Code examples

1. Search for all trials studying renal cell carcinoma using the web front-end

temp <- tempfile()
download.file("", temp)
unzip(temp, list=TRUE)[1] %>%
  count() %>%
## n
## 1744

2. Search for all trials studying renal cell carcinoma using the API

read_xml("[Condition]renal+cell+carcinoma") %>%
  xml_find_first("/FullStudiesResponse/NStudiesFound") %>%
## [1] "1562"

3. Distribution of phases of clinical trials testing sunitinib in renal cell carcinoma

read_xml("[Condition]renal+cell+carcinoma+AND+AREA[InterventionName]sunitinib") %>%
  xml_find_all("//Field[@Name='Phase']") %>%
  xml_text() %>%
  as.factor() %>%
## Not Applicable Phase 1 Phase 2 Phase 3 Phase 4
## 5              13      53      13      3

Published by

The Grey Literature

This is the personal blog of Benjamin Gregory Carlisle PhD. Queer; Academic; Queer academic. "I'm the research fairy, here to make your academic problems disappear!"

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.