Requirements: Linux desktop computer with Python 3
There is a wonderful site that has transcripts for every Star Trek episode ever at chakoteya.net. (Thank you!) This will be the data source that we will be using for this tutorial. And if you appreciate the work that went into transcribing, there’s a donation button on their site.
In order to reduce the amount of strain that I’m putting on their server, I made a local copy of all their transcripts by scraping their site using the following in the command line:
$ wget -r -np http://chakoteya.net/DS9/episodes.htm
This step only has to be done once, and now the files are saved locally, we don’t have to keep hitting their server with requests for transcripts. This will get you all the transcripts for DS9, but you could also navigate to, say, the page for TNG and do the same there if you were so inclined.
This produces a directory full of numbered HTML files (401.htm to 575.htm, in the case of DS9) and some other files (episodes.htm and robots.txt) that can be safely discarded.
Make a new directory for your work. I keep my projects in
~/Software/, and this one in particular I put in
~/Software/extract-lines/ but you can keep it wherever. Make a folder called
extract-lines and fill it with the numbered HTML files you downloaded previously.
Make a new file called
extract.py. with the following Python code inside it:
# Provide the character name you wish to extract as an argument to this script # Must be upper case (e.g. "GARAK" not "Garak") # For example: # $ python3 extract.py GARAK import sys import os from html.parser import HTMLParser class MLStripper(HTMLParser): def __init__(self): self.reset() self.strict = False self.convert_charrefs= True self.fed =  def handle_data(self, d): self.fed.append(d) def get_data(self): return ''.join(self.fed) def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data() corpus_file = open(str(sys.argv) + ".txt", "a") for file_name in os.listdir ("scripts/"): script_file = open ("scripts/" + file_name, "r") script_lines = script_file.readlines() line_count = 0 for script_line in script_lines: extracted_line = "" if script_line.startswith(str(sys.argv)): extracted_line += strip_tags(script_line[len(str(sys.argv))+1:]) if "<br>" not in script_line and "</font>" not in script_line: more_lines = "" more_lines_counter = 1 while "<br>" not in more_lines and "</font>" not in more_lines: more_lines = strip_tags(more_lines) + script_lines[line_count + more_lines_counter] more_lines_counter += 1 extracted_line += strip_tags(more_lines) extracted_line = extracted_line.replace("\n", " ") corpus_file.write(extracted_line.strip() + "\n") line_count += 1 corpus_file.close()
Back in the command line, go to the
extract-lines/ folder, and run the following command:
$ python3 extract.py GARAK
This will make a text file called GARAK.txt in the
extract-lines/ folder that contains every line spoken by Garak.
Do that for every character whose lines you want to extract. You’ll end up with a bunch of .txt files that you can copy into a new project.
Now, make a new folder. I put mine in
You’ll need to make a Python virtual environment because whoever invented Python hates you. Run the following in your terminal and don’t think too much about it:
$ cd ~/Software/more_ds9/ $ python3 -m venv env $ source env/bin/activate $ pip install markovify
Okay I guess I should explain. What you’ve done is created a little mini-Python installation inside your system’s big Python installation, so that you can install packages just for this project without them affecting anything else. To access this, in the terminal you ran
$ source env/bin/activate, and if you want to run your Python code later and have it work, you have to do that first every time. When you’re done with it, just type
Make a new file in your project directory called
markov.py with the following Python code in it:
# Usage example: # $ python3 markov.py GARAK import sys import markovify with open ("corpuses/" + str(sys.argv) + ".txt") as corpus_file: corpus_text = corpus_file.read() # Build the model text_model = markovify.Text(corpus_text) # Generate a sentence print(str(str(sys.argv) + ": " + str(text_model.make_sentence())))
Make a new directory called
more_ds9 and copy all the text files that you generated in your
extract-lines project above.
Go to your command line and type the following:
$ python3 markov.py GARAK
It should give you some output like:
GARAK: This time, Intendant, I trust the source, but rest assured I will confirm the rod's authenticity before I say I am.
If you change “GARAK” to any other character whose lines you extracted in the previous project, you will get an output generated by that character. Now you have the tools and data sources to make a Markov chain for any character in any Star Trek series you like!
And if you don’t want to bother with Python and all that, I took this method and built a fedibot that posts “new” Deep Space Nine dialogue using this method once per hour, which you can find here: https://botsin.space/@moreds9