Research ethics and meta-research depend on reliable data from regulatory bodies such as the FDA. These provide information that is used to evaluate drugs, clinical trials of new therapies, and even entire research programmes. Transparency in regulatory decisions and documentation is also a necessary part of a modern health information economy.
The FDA publishes several data sets to further these ends, including the Drugs@FDA data set, and the Postmarketing Commitments Data set. The Drugs@FDA data set contains information regarding drug products that are regulated by the FDA, submissions to the FDA regarding these products and related application documents, their meta-data and links to the documents themselves.
Errors in these data files may invalidate other meta-research on drug development, and threaten the trust we have in regulatory institutions.
The Drugs@FDA data file was downloaded from the following address, as specified in the R code below:
The version dated 2019 July 16 has been saved to this blog at the following address for future reference, in case the link is changed later on, or the issue reported in the following is addressed:
The following code was run in R:
library(readr) library(ggplot2) # Download files from FDA temp <- tempfile() download.file("https://www.fda.gov/media/89850/download", temp) # Import Application Docs ApplicationDocs <- read_delim( unz(temp, "ApplicationDocs.txt"), "\t", escape_double = FALSE, col_types = cols( ApplicationDocsDate = col_date(format = "%Y-%m-%d 00:00:00"), ApplicationDocsID = col_integer(), ApplicationDocsTitle = col_character(), ApplicationDocsTypeID = col_integer(), SubmissionNo = col_integer() ), trim_ws = TRUE ) # Plot Application Docs histogram (Figure 1) png( "~/Downloads/app-docs.png", 600, 400 ) ggplot( aes( x = ApplicationDocsDate ), data = ApplicationDocs ) + geom_histogram( binwidth = 365.25 ) + labs ( title = "Histogram of Drugs@FDA application document dates", x = "Application document date", y = "Number of documents" ) dev.off() # Import Application Docs Types ApplicationsDocsType_Lookup <- read_delim( unz(temp, "ApplicationsDocsType_Lookup.txt"), "\t", escape_double = FALSE, col_types = cols( ApplicationDocsType_Lookup_ID = col_integer() ), trim_ws = TRUE ) # Delete the downloaded files, as they're no longer necessary unlink(temp) # Merge Application Docs information with Document Types Application_Docs_With_Types <- merge( ApplicationDocs, ApplicationsDocsType_Lookup, by.x = "ApplicationDocsTypeID", by.y = "ApplicationDocsType_Lookup_ID" ) # Restrict to pediatric only Pediatric_Docs <- subset( Application_Docs_With_Types, grepl( "pediatric", ApplicationDocsType_Lookup_Description, ignore.case = TRUE ) ) # Plot Pediatric Application Docs histogram (Figure 2) png( "~/Downloads/ped-docs.png", 600, 400 ) ggplot( aes( x = ApplicationDocsDate ), data = Pediatric_Docs ) + geom_histogram( binwidth = 365.25 ) + labs ( title = "Histogram of Drugs@FDA application document dates (pediatric only)", x = "Application document date", y = "Number of documents" ) dev.off()
These data were analyzed using R version 3.6.1 (2019-07-05)¹ and plotted using the ggplot2 package.²
There are a total of 57,495 application documents published in the Drugs@FDA data files, with dates ranging from 1900-01-01 to 2019-07-16, the date the data set was published, see Figure 1.
The histogram shows a spike of 1404 application documents at the year 1900, followed by an absence of FDA application documents between 1900 and 1955. There is a steady increase in the number of application documents starting in the 1990’s until the present day. All of the application documents that comprise that spike in the year 1900 are dated exactly 1900-01-01.
These 1404 application documents dated 1900-01-01, all have an application document type that includes the term “pediatric.” (“Pediatric Addendum,” “Pediatric Amendment,” “Pediatric CDTL Review,” “Pediatric Clinical Pharmacology Addendum,” etc.)
Among the 57,495 published Drugs@FDA application documents, there are a total of 1666 documents whose application document type includes the term “pediatric,” only 262 of which are dated after 1900-01-01, see Figure 2.
These data suggest that most of the FDA application documents that have pediatric document types—1404 distinct documents (84% of pediatric application documents and 2% of all documentation published in the Drugs@FDA data files) have an inaccurate date.
This may have arisen from a data entry error in which unknown dates were marked with “00” and that was interpreted by the FDA software as “1900.” These errors may have gone un-noticed because the website that interprets the Drugs@FDA data set does not display dates for individual documents, although these are reported in the downloadable data file. These errors become apparent when FDA data are included in other software, such as Clinical trials viewer.³
The potential errors reported here can be corrected by manually extracting the dates from the linked PDF documents and entering them in the Drugs@FDA database back-end.
Swiftly correcting errors can help maintain trust in regulatory instutions’ databases; help ensure the quality of meta-research; aid in research ethics, and provide transparency.
- R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
- H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
- Carlisle BG. Clinical trials viewer [Internet]. Retrieved from https://trials.bgcarlisle.com/: The Grey Literature; 2019. Available from: https://trials.bgcarlisle.com/