Introducing PubMed NCT Extractor

Inspired by talks at the 2019 METAxDATA un-conference, I wrote a little meta-research tool to batch extract (“NCT”) numbers from PubMed XML search results and test whether they correspond to legitimate entries.

It’s written for Python 3 on elementary OS 5.1, so I can’t guarantee it will work on anything else. I also wrote a paper based on this tool that I’m currently sending to journals for review. If you’re interested in reading a draft, let me know and I’ll be happy to share it with you.

You can get the code from Codeberg! If you try it out or use it for something, let me know!

The Stradivarielzebub: making an electric fiddle

On October 7, 2018, I was looking at electric fiddles, as is my habit from time to time. If you’ve never seen an electric violin before, you should check them out. They’re often shaped differently from acoustic violins and they are strange and wonderful. My favourite ones are the ones that look like skeletons of a violin. On a whim, it occurred to me that an electric fiddle that was gold-coloured would be a fun Devil Went Down to Georgia reference. So I started looking around and I couldn’t find one. Maybe I’m bad at online shopping, but as far as I could tell, they were just not for sale.

My current fiddle teacher has an electric fiddle that’s made of plexiglass, so it’s see-through, with multi-colour LED’s, and I suppose, partly inspired by that and by the lack of gold fiddles, I started looking into how hard it would be to make one myself. It turns out, it is very difficult to make a violin of any kind, however it is much less difficult to make an electric violin than it is to make an acoustic one.


I started by drawing out some concept sketches for the fiddle. If I was going to make the fiddle of the Devil himself, it would need some artistic flourishes along those lines.

Then I took some measurements from my acoustic fiddle, did some research, made a few assumptions, and drew a more specific plan.

Making rough cuts

My boyfriend and I went to the hardware store, picked out the wood and went to his father’s garage, to borrow his tools. I did the measuring and the drawing. My boyfriend operated the machines. His brother cut a small piece of metal for us to use to brace the strings at the back of the scroll.

It makes me really happy that this is a project I got to do together with my boyfriend and his family.

We had a couple false starts, but we’re starting to get the hang of it now!


Sanding is one of those things that’s terrible to have to do, but very satisfying to take before and after pictures of.

Staining and gluing the fingerboard on

Burning “Homo fuge” on the back

In Marlowe’s Doctor Faustus, after he cuts his arm to get the blood to sign the contract with Mephistopheles, his arm is miraculously healed, and as a warning from heaven, the words “Homo fuge”—”fly, o man” in Latin—appear there, as a warning for him to get out of that situation. So it seemed appropriate to burn that into the back of the neck of the Stradivarielzebub.

Gluing the pieces together

Applying varnish and tuning pegs

The (almost) final product

On October 7, 2019, we actually strung the fiddle and plugged it in to an amp for the first time.

But does it play?

What’s next?

So there’s a couple things that are left. I want to gild the tail and scroll with gold leaf, so that it can properly be referred to as a “shiny fiddle made of gold” as per The Devil Went Down to Georgia, and there’s a couple small adjustments that I’d like to make so that it’s a bit more playable. Even today, a day later, I’ve taken the strings off again to fix it up, and there’s a few things I want to do to make it better.

And I’ve already got a sketch for what to do for the next violin!

Update: 2019-10-11 (gold leaf; minor adjustments)

I took the fiddle apart for a couple days, made the adjustments I meant to, and put gold leaf on the Devil’s horns and tail. The Devil as they say, is in the details.

Putting gold leaf on something is terrible to do. It’s like working with tin foil, but a tin foil that’s so thin, if you breathe on it too hard, it’ll rip.

And here’s the final result!

Can you predict the outcome of Canada’s 43rd federal election?

In 2015, I collected predictions for the outcome of the federal election. It was a semi-scientific, just-for-fun endeavor, but one that I want to try again. When I was done, I analyzed the predictions and wrote up a little report, available on my blog.

This election is looking to be a close call (again), so I’m collecting predictions just for fun. Tell your friends!

I’m offering a beer to whoever (over drinking age) makes the best prediction.

Here’s the link:

You can make as many predictions as you like (but please give me an email address if you make more than one so I can combine yours together in analysis). The page will stop accepting new predictions when the last poll closes in Newfoundland on 2019 October 21.

Getting fountain-mode in Emacs working with imenu-list-mode and writeroom-mode

So I was playing around in Emacs recently [insert obligatory joke here] and saw a working environment that looked kinda like the following:

Emacs fountain-mode with olivetti-mode and imenu-list-minor-mode

I wanted to try it out, but some of the documentation for how to do so was missing. I did some searching around and contacted the people who run that site and was able to get my system to look the same, at least in all the ways that were relevant to me.

So, I wrote it all down, mostly for my own reference later, but maybe you will find it useful too.

My setup: elementary OS v 5.0 (based on Ubuntu 18.04) and Emacs v 25.2.1.

I’m going to assume that the reader knows how to use Emacs as a regular text editor already. There’s a tutorial built into Emacs that is better than what I would write anyway, so if this is your first time using this text editor, I recommend going through it.

Fountain is a markup language for screenwriting. Similar to Markdown, it is unformatted text that many common text editors can parse to apply a simple set of formatting rules. This allows one to focus on writing the story, rather than the details of making sure that Microsoft Word or whatever is correctly applying your stylesheets.

Installing Emacs packages

From inside Emacs, type:

M-x package-install RET fountain-mode RET

This will install the major mode for editing .fountain files. Now, if you open an example .fountain file, it should be parsed with the appropriate markup rules and have syntax-appropriate highlighting and formatting applied.

At this point, you can also already export the .fountain file as a file to be typset by LaTeX into a PDF, or an HTML file to publish online or a couple other formats.

M-x package-install RET olivetti RET

This command will install olivetti-mode, a minor mode which allows one to constrain the width of column in which you’re typing, regardless of the width of the window in which it is contained. This also eliminates distractions in one’s writing environment, as the number of characters you will see across a text file are consistent. This also reduces eye strain, because you can now maximize the window to cover your entire screen without having to follow a line of text that is hundreds of characters long.

M-x package-install RET imenu-list RET

This command installs imenu-list, a plugin that creates an automatically updated index for an Emacs buffer. This will help with navigating your document while you’re editing it.

M-x package-install RET writeroom-mode RET

This command installs a minor mode that makes Emacs go full-screen, and provides a “distraction-free” editing environment.

Installing Courier

Download Courier Prime Emacs, and install on your computer.

To make this work with fountain-mode, go to: Fountain > Customize Faces from the menu bar.

Fountain > Customize Faces in Emacs

This will bring up a page where you can customize the formatting applied in fountain-mode. Scroll down to the first “Fountain face / Default base-level face for ‘fountain-mode’ buffers.” Click on the disclosure triangle to the left of where it says “Fountain” and then click on “Show All Attributes.”

Click the “Font Family” box and then type “Courier Prime Emacs,” then click the “Height” box, change the units to 1/10 pt and enter “120.”

Now click the “State” button and “Save for future sessions.”

When you open fountain-mode, the font will now be 12pt Courier!

Turning these modes on automatically

Right now, fountain-mode should already start automatically when you open a .fountain file, but we want to have olivetti-mode too, as well as an imenu buffer for ease of navigation.

To have fountain-mode start other minor modes when it starts, go to: Fountain > Customize Mode from the menu bar.

Fountain > Customize Mode in Emacs

This will bring up an editor with a lot of options. Scroll down until you find “Fountain Mode Hook / Hook run after entering Fountain mode,” then click the disclosure triangle.

You can uncheck turn-on-visual-line-mode, because this is done by olivetti-mode.

Click the “INS” button and type “olivetti-mode” in the grey text box.

Click “INS” again and type “imenu-list-minor-mode” in the grey text box.

Click “State” and then “Save for future sessions.”

When you’re done editing it, it should look like the following:

Fountain Mode Hook editor

Now when you open a .fountain file, olivetti-mode and imenu-list-minor-mode will start.

I did not tell it to start writeroom-mode automatically, because I personally don’t always want that. If you want to start writeroom-mode, type the following command:

M-x writeroom-mode RET

This will make the current buffer full-screen, and it will hide the imenu buffer. If you want to see the navigation again, the following commands will work:

C-x 3
C-x o
C-x b

The first command listed above splits the window vertically, the second moves the focus to the new window (optional) and the third one prompts you to choose which buffer to switch to.

Type *Ilist* to select the buffer with the scenes and sections menu. Drag the vertical column divider to your liking and enjoy your distraction-free workspace!

But wait, I want to export to PDF

To export your screenplay, Go to Fountain > Export > Buffer to LaTeX.

If your .fountain file is named screenplay.fountain, then it will make a new file in the same directory named screenplay.tex.

LaTeX is not a PDF, but it’s a step in the right direction.

Open the terminal and type the following:

$ sudo apt install texlive-xetex

It will ask for your admin password, and then start installing some software. Once it finishes (it may take a while), in the terminal, navigate to the directory where your .tex file is saved and type the following:

$ xelatex screenplay.tex

This will produce a new file called screenplay.pdf, which will be your printable, final output. Note that section headings (if you used them in your .fountain file, they would have started with #’s) and comments are not included in the final PDF. They are for the writer’s reference only.

Here’s an R function that you can use to download clinical trial data

Sometimes you need to look at a lot of sets of clinical trial data, and you don’t want to go to and do searches manually, then save the files manually, then load them into R or Excel manually to get your work done.


get_ct_dot_gov_data_for_drugname <- function (drugname) {

   temp <- tempfile()

   download.file(paste0("", URLencode(drugname, reserved = TRUE), "&flds=a&flds=b&flds=y"), temp)

   trial_data <- read_delim(
     escape_double = FALSE,
     trim_ws = TRUE


   return (trial_data)


So here’s a function that you can use to download all the trials for a given drug name, and it returns a data frame with the trial metadata.


How to train a Markov chain based on any Star Trek character

Requirements: Linux desktop computer with Python 3

There is a wonderful site that has transcripts for every Star Trek episode ever at (Thank you!) This will be the data source that we will be using for this tutorial. And if you appreciate the work that went into transcribing, there’s a donation button on their site.

In order to reduce the amount of strain that I’m putting on their server, I made a local copy of all their transcripts by scraping their site using the following in the command line:

$ wget -r -np

This step only has to be done once, and now the files are saved locally, we don’t have to keep hitting their server with requests for transcripts. This will get you all the transcripts for DS9, but you could also navigate to, say, the page for TNG and do the same there if you were so inclined.

This produces a directory full of numbered HTML files (401.htm to 575.htm, in the case of DS9) and some other files (episodes.htm and robots.txt) that can be safely discarded.

Make a new directory for your work. I keep my projects in ~/Software/, and this one in particular I put in ~/Software/extract-lines/ but you can keep it wherever. Make a folder called scripts inside extract-lines and fill it with the numbered HTML files you downloaded previously.

Make a new file called with the following Python code inside it:

# Provide the character name you wish to extract as an argument to this script
# Must be upper case (e.g. "GARAK" not "Garak")
# For example:
# $ python3 GARAK

import sys
import os

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    return s.get_data()

corpus_file = open(str(sys.argv[1]) + ".txt", "a")

for file_name in os.listdir ("scripts/"):
    script_file = open ("scripts/" + file_name, "r")
    script_lines = script_file.readlines()

    line_count = 0

    for script_line in script_lines:
        extracted_line = ""
        if script_line.startswith(str(sys.argv[1])):
            extracted_line += strip_tags(script_line[len(str(sys.argv[1]))+1:])
            if "<br>" not in script_line and "</font>" not in script_line:
                more_lines = ""
                more_lines_counter = 1
                while "<br>" not in more_lines and "</font>" not in more_lines:
                    more_lines = strip_tags(more_lines) + script_lines[line_count + more_lines_counter]
                    more_lines_counter += 1
                extracted_line += strip_tags(more_lines)
                extracted_line = extracted_line.replace("\n", " ")
            corpus_file.write(extracted_line.strip() + "\n")
        line_count += 1


Back in the command line, go to the extract-lines/ folder, and run the following command:

$ python3 GARAK

This will make a text file called GARAK.txt in the extract-lines/ folder that contains every line spoken by Garak.

Do that for every character whose lines you want to extract. You’ll end up with a bunch of .txt files that you can copy into a new project.

Now, make a new folder. I put mine in ~/Software/more_ds9/.

You’ll need to make a Python virtual environment because whoever invented Python hates you. Run the following in your terminal and don’t think too much about it:

$ cd ~/Software/more_ds9/
$ python3 -m venv env
$ source env/bin/activate
$ pip install markovify

Okay I guess I should explain. What you’ve done is created a little mini-Python installation inside your system’s big Python installation, so that you can install packages just for this project without them affecting anything else. To access this, in the terminal you ran $ source env/bin/activate, and if you want to run your Python code later and have it work, you have to do that first every time. When you’re done with it, just type $ deactivate.

Make a new file in your project directory called with the following Python code in it:

# Usage example:
# $ python3 GARAK

 import sys
 import markovify
 with open ("corpuses/" + str(sys.argv[1]) + ".txt") as corpus_file:
     corpus_text =

# Build the model
 text_model = markovify.Text(corpus_text)

# Generate a sentence
 print(str(str(sys.argv[1]) + ": " + str(text_model.make_sentence())))

Make a new directory called corpuses inside more_ds9 and copy all the text files that you generated in your extract-lines project above.

Go to your command line and type the following:

$ python3 GARAK

It should give you some output like:

GARAK: This time, Intendant, I trust the source, but rest assured I will confirm the rod's authenticity before I say I am.

If you change “GARAK” to any other character whose lines you extracted in the previous project, you will get an output generated by that character. Now you have the tools and data sources to make a Markov chain for any character in any Star Trek series you like!

And if you don’t want to bother with Python and all that, I took this method and built a fedibot that posts “new” Deep Space Nine dialogue using this method once per hour, which you can find here:

Most pediatric approval documents are filed under the wrong date in the Drugs@FDA data files


Research ethics and meta-research depend on reliable data from regulatory bodies such as the FDA. These provide information that is used to evaluate drugs, clinical trials of new therapies, and even entire research programmes. Transparency in regulatory decisions and documentation is also a necessary part of a modern health information economy.

The FDA publishes several data sets to further these ends, including the Drugs@FDA data set, and the Postmarketing Commitments Data set. The Drugs@FDA data set contains information regarding drug products that are regulated by the FDA, submissions to the FDA regarding these products and related application documents, their meta-data and links to the documents themselves.

Errors in these data files may invalidate other meta-research on drug development, and threaten the trust we have in regulatory institutions.


The Drugs@FDA data file was downloaded from the following address, as specified in the R code below:

The version dated 2019 July 16 has been saved to this blog at the following address for future reference, in case the link is changed later on, or the issue reported in the following is addressed:

The following code was run in R:


# Download files from FDA

temp <- tempfile()
download.file("", temp)

# Import Application Docs

 ApplicationDocs <- read_delim(
   unz(temp, "ApplicationDocs.txt"),
   escape_double = FALSE,
   col_types = cols(
     ApplicationDocsDate = col_date(format = "%Y-%m-%d 00:00:00"),
     ApplicationDocsID = col_integer(),
     ApplicationDocsTitle = col_character(),
     ApplicationDocsTypeID = col_integer(),
     SubmissionNo = col_integer()
   trim_ws = TRUE

# Plot Application Docs histogram (Figure 1)

     x = ApplicationDocsDate
   data = ApplicationDocs
 ) + geom_histogram(
   binwidth = 365.25
 ) + labs (
   title = "Histogram of Drugs@FDA application document dates",
   x = "Application document date",
   y = "Number of documents"

# Import Application Docs Types

ApplicationsDocsType_Lookup <- read_delim(
   unz(temp, "ApplicationsDocsType_Lookup.txt"),
   escape_double = FALSE,
   col_types = cols(
     ApplicationDocsType_Lookup_ID = col_integer()
   trim_ws = TRUE

# Delete the downloaded files, as they're no longer necessary


# Merge Application Docs information with Document Types

Application_Docs_With_Types <- merge(
   by.x = "ApplicationDocsTypeID",
   by.y = "ApplicationDocsType_Lookup_ID"

# Restrict to pediatric only

Pediatric_Docs <- subset(
     ApplicationDocsType_Lookup_Description, = TRUE

# Plot Pediatric Application Docs histogram (Figure 2)

     x = ApplicationDocsDate
   data = Pediatric_Docs
 ) + geom_histogram(
   binwidth = 365.25
 ) + labs (
   title = "Histogram of Drugs@FDA application document dates (pediatric only)",
   x = "Application document date",
   y = "Number of documents"

These data were analyzed using R version 3.6.1 (2019-07-05)¹ and plotted using the ggplot2 package.²


There are a total of 57,495 application documents published in the Drugs@FDA data files, with dates ranging from 1900-01-01 to 2019-07-16, the date the data set was published, see Figure 1.

Figure 1. Number of FDA application documents published over time

The histogram shows a spike of 1404 application documents at the year 1900, followed by an absence of FDA application documents between 1900 and 1955. There is a steady increase in the number of application documents starting in the 1990’s until the present day. All of the application documents that comprise that spike in the year 1900 are dated exactly 1900-01-01.

These 1404 application documents dated 1900-01-01, all have an application document type that includes the term “pediatric.” (“Pediatric Addendum,” “Pediatric Amendment,” “Pediatric CDTL Review,” “Pediatric Clinical Pharmacology Addendum,” etc.)

Among the 57,495 published Drugs@FDA application documents, there are a total of 1666 documents whose application document type includes the term “pediatric,” only 262 of which are dated after 1900-01-01, see Figure 2.

Figure 2. Number of FDA application documents with a pediatric document type published over time


These data suggest that most of the FDA application documents that have pediatric document types—1404 distinct documents (84% of pediatric application documents and 2% of all documentation published in the Drugs@FDA data files) have an inaccurate date.

This may have arisen from a data entry error in which unknown dates were marked with “00” and that was interpreted by the FDA software as “1900.” These errors may have gone un-noticed because the website that interprets the Drugs@FDA data set does not display dates for individual documents, although these are reported in the downloadable data file. These errors become apparent when FDA data are included in other software, such as Clinical trials viewer

The potential errors reported here can be corrected by manually extracting the dates from the linked PDF documents and entering them in the Drugs@FDA database back-end.

Swiftly correcting errors can help maintain trust in regulatory instutions’ databases; help ensure the quality of meta-research; aid in research ethics, and provide transparency.


  1. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
  2. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  3. Carlisle BG. Clinical trials viewer [Internet]. Retrieved from The Grey Literature; 2019. Available from: