How to train a Markov chain based on any Star Trek character

Requirements: Linux desktop computer with Python 3

There is a wonderful site that has transcripts for every Star Trek episode ever at chakoteya.net. (Thank you!) This will be the data source that we will be using for this tutorial. And if you appreciate the work that went into transcribing, there’s a donation button on their site.

In order to reduce the amount of strain that I’m putting on their server, I made a local copy of all their transcripts by scraping their site using the following in the command line:

$ wget -r -np http://chakoteya.net/DS9/episodes.htm

This step only has to be done once, and now the files are saved locally, we don’t have to keep hitting their server with requests for transcripts. This will get you all the transcripts for DS9, but you could also navigate to, say, the page for TNG and do the same there if you were so inclined.

This produces a directory full of numbered HTML files (401.htm to 575.htm, in the case of DS9) and some other files (episodes.htm and robots.txt) that can be safely discarded.

Make a new directory for your work. I keep my projects in ~/Software/, and this one in particular I put in ~/Software/extract-lines/ but you can keep it wherever. Make a folder called scripts inside extract-lines and fill it with the numbered HTML files you downloaded previously.

Make a new file called extract.py. with the following Python code inside it:

# Provide the character name you wish to extract as an argument to this script
# Must be upper case (e.g. "GARAK" not "Garak")
# For example:
# $ python3 extract.py GARAK

import sys
import os

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

corpus_file = open(str(sys.argv[1]) + ".txt", "a")

for file_name in os.listdir ("scripts/"):
    script_file = open ("scripts/" + file_name, "r")
    script_lines = script_file.readlines()

    line_count = 0

    for script_line in script_lines:
        extracted_line = ""
        if script_line.startswith(str(sys.argv[1])):
            extracted_line += strip_tags(script_line[len(str(sys.argv[1]))+1:])
            if "<br>" not in script_line and "</font>" not in script_line:
                more_lines = ""
                more_lines_counter = 1
                while "<br>" not in more_lines and "</font>" not in more_lines:
                    more_lines = strip_tags(more_lines) + script_lines[line_count + more_lines_counter]
                    more_lines_counter += 1
                extracted_line += strip_tags(more_lines)
                extracted_line = extracted_line.replace("\n", " ")
            corpus_file.write(extracted_line.strip() + "\n")
        line_count += 1

corpus_file.close()

Back in the command line, go to the extract-lines/ folder, and run the following command:

$ python3 extract.py GARAK

This will make a text file called GARAK.txt in the extract-lines/ folder that contains every line spoken by Garak.

Do that for every character whose lines you want to extract. You’ll end up with a bunch of .txt files that you can copy into a new project.

Now, make a new folder. I put mine in ~/Software/more_ds9/.

You’ll need to make a Python virtual environment because whoever invented Python hates you. Run the following in your terminal and don’t think too much about it:

$ cd ~/Software/more_ds9/
$ python3 -m venv env
$ source env/bin/activate
$ pip install markovify

Okay I guess I should explain. What you’ve done is created a little mini-Python installation inside your system’s big Python installation, so that you can install packages just for this project without them affecting anything else. To access this, in the terminal you ran $ source env/bin/activate, and if you want to run your Python code later and have it work, you have to do that first every time. When you’re done with it, just type $ deactivate.

Make a new file in your project directory called markov.py with the following Python code in it:

# Usage example:
# $ python3 markov.py GARAK

 import sys
 import markovify
 with open ("corpuses/" + str(sys.argv[1]) + ".txt") as corpus_file:
     corpus_text = corpus_file.read()

# Build the model
 text_model = markovify.Text(corpus_text)

# Generate a sentence
 print(str(str(sys.argv[1]) + ": " + str(text_model.make_sentence())))

Make a new directory called corpuses inside more_ds9 and copy all the text files that you generated in your extract-lines project above.

Go to your command line and type the following:

$ python3 markov.py GARAK

It should give you some output like:

GARAK: This time, Intendant, I trust the source, but rest assured I will confirm the rod's authenticity before I say I am.

If you change “GARAK” to any other character whose lines you extracted in the previous project, you will get an output generated by that character. Now you have the tools and data sources to make a Markov chain for any character in any Star Trek series you like!

And if you don’t want to bother with Python and all that, I took this method and built a fedibot that posts “new” Deep Space Nine dialogue using this method once per hour, which you can find here: https://botsin.space/@moreds9

Published by

The Grey Literature

This is the personal blog of Benjamin Gregory Carlisle PhD. Queer; Academic; Queer academic. "I'm the research fairy, here to make your academic problems disappear!"

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.