Clinicaltrials.gov helpfully provides a facility for downloading machine-readable XML files of its data. Here’s an example of a zipped file of 10 clinicaltrials.gov XML files.
Unfortunately, a big zipped folder of XML files is not that helpful. Even after parsing a whole bunch of trials into a single data frame in R, there are a few fields that are written in the least useful format ever. For example, the <study_design> field usually looks something like this:
Allocation: Non-Randomized, Endpoint Classification: Safety Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment
So, I wrote a little R script to help us all out. Do a search on clinicaltrials.gov, then save the unzipped search result in a new directory called search_result/ in your ~/Downloads/ folder. The following script will parse through each XML file in that directory, putting each one in a new data frame called “trials”, then it will explode the <study_design> field into individual columns.
So for example, based on the example field above, it would create new columns called “Allocation”, “Endpoint_Classification”, “Intervention_Model”, “Masking”, and “Primary_Purpose”, populated with the corresponding data.
require ("XML")
require ("plyr")
# Change path as necessary
path = "~/Downloads/search_result/"
setwd(path)
xml_file_names <- dir(path, pattern = ".xml")
counter <- 1
# Makes data frame by looping through every XML file in the specified directory
for ( xml_file_name in xml_file_names ) {
xmlfile <- xmlTreeParse(xml_file_name)
xmltop <- xmlRoot(xmlfile)
data <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
if ( counter == 1 ) {
trials <- data.frame(t(data), row.names = NULL)
} else {
newrow <- data.frame(t(data), row.names = NULL)
trials <- rbind.fill (trials, newrow)
}
# This will be good for very large sets of XML files
print (
paste0(
xml_file_name,
" processed (",
format(100 * counter / length(xml_file_names), digits = 2),
"% complete)"
)
)
counter <- counter + 1
}
# Data frame has been constructed. Comment out the following two loops
# (until the "un-cluttering" part) in the case that you are not interested
# in exploding the <study_design> column.
columns = vector();
for ( stu_des in trials$study_design ) {
# splits by commas NOT in parentheses
for (pair in strsplit( stu_des, ", *(?![^()]*\\))", perl=TRUE)) {
newcol <- substr( pair, 0, regexpr(':', pair) - 1 )
columns <- c(columns, newcol)
}
}
for ( newcol in unique(columns) ) {
# get rid of spaces and special characters
newcol <- gsub('([[:punct:]])|\\s+','_', newcol)
if (newcol != "") {
# add the new column
trials[,newcol] <- NA
i <- 1
for ( stu_des2 in trials$study_design ) {
for (pairs in strsplit( stu_des2, ", *(?![^()]*\\))", perl=TRUE)) {
for (pair in pairs) {
if ( gsub('([[:punct:]])|\\s+','_', substr( pair, 0, regexpr(':', pair) - 1 )) == newcol ) {
trials[i, ncol(trials)] <- substr( pair, regexpr(':', pair) + 2, 100000 )
}
}
}
i <- i+1
}
}
}
# Un-clutter the working environment
remove (i)
remove (counter)
remove (data)
remove (newcol)
remove (newrow)
remove (columns)
remove (pair)
remove (pairs)
remove (stu_des)
remove (stu_des2)
remove (xml_file_name)
remove (xml_file_names)
remove (xmlfile)
remove (xmltop)
# Get nice NCT id's
get_nct_id <- function ( row_id_info ) {
return (unlist (row_id_info) ["nct_id"])
}
trials$nct_id <- lapply(trials$id_info, function(x) get_nct_id (x))
# Clean up enrolment field
trials$enrollment[trials$enrollment == "NULL"] <- NA
trials$enrollment <- as.numeric(trials$enrollment)
Useful references:
- https://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/
- http://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
- http://stackoverflow.com/questions/21105360/regex-find-comma-not-inside-quotes
