Update: you can find the next post in this series here.
You probably have a favorite Simpsons character. Maybe you hope to someday block out the sun, Mr. Burns style, maybe you enjoy Homer’s skill in averting meltdowns, or maybe you identify with Lisa’s struggles for acceptance. Through its characters, the Simpsons made a huge impact on a generation, and although the show is still running, my best memories will be of the early seasons.
I recently wanted to do some linguistic analysis of Simpsons episodes. To my surprise, I found that it is impossible to find Simpsons episodes scripts with information about who is speaking which line. You can find episode capsules on snpp.com, but they feature only quotes from the episodes. Simpsoncrazy has some scripts, but not enough for any real analysis. You can find complete transcripts, but they lack any information on who is speaking when.
Here is a (very confusing) sample of a transcript:
1 2 3 4 5 6 7
I immediately set about to correct this glaring omission. Rather that sit down and manually label each episode with speaker information, I decided to use the power of the computer to do it for me. To quote Homer,
Don't worry, head. The computer will do our thinking now.
I will follow this post up with a technical one, but for now, you can find all of my code here.
Get the transcripts!
Our first step is to grab the scripts and the transcripts. The scripts have the information we need (lines and who spoke them):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
So, we have the scripts and transcripts, right? (It’s amazing how easy it is to do this step if you just decide that you have done it). We are going to use the scripts to train the computer to figure out who is speaking in the transcripts.
Just like it is difficult to decide which one of the 5000 varieties of soap at the supermarket is the correct one, it is equally difficult for a computer to decide which one of 50+ characters is speaking a line. We are going to figure out which characters talk in a similar way to others, and group them together, based on our scripts.
We can figure out which characters are linguistically similar to each other, based on our small sample of scripts:
In the above, the larger clusters contain multiple characters, but they are only labelled by the first character. So, cluster “Moe” consists of:
1 2 3 4 5 6 7
So, we group them together. As we can see, the main characters are all very distinct, due to their large amount of dialogue.
Some data exploration
Let’s look at the initial data before we dive in.
We can see how many lines each character has in our scripts:
We can also see what words each character is most likely to say (more common towards the bottom). As we have always suspected, it looks like the show does, in fact, revolve around Homer:
Make our model
We can then make a model to predict who is speaking the lines of text in the transcripts.
We can do this by applying a random forest classifier. This will tell us whether the line
I think the boy's hurt. is Smithers expressing concern, or Burns delighting in causing pain (although attentive Simpsons fans already know the answer).
We also do some matching to find which lines in the scripts exactly correspond to lines in the transcripts, using the k-nearest neighbors algorithm.
After we do this, we end up with labelled transcripts:
1 2 3 4 5
We can also see who is speaking how many lines now:
Not all of the lines have been labelled, as the lines the algorithm is uncertain about are left unlabelled. Homer is also being assigned too many lines, but the proportion is semi-close to our previous graph. We can see what the commonly said words are:
From some of the commonly said words, we can see that some lines are being mis-attributed (lisa rarely says Homer, for instance), but it looks reasonable overall.
We can also extract basic summary statistics, such as, there are
102066 lines in the transcripts, and
1016692 words, for an average of
9.96 words per line.
What does this mean?
This leaves us in a good spot. We have labelled the speakers, and although it isn’t great, it is a good starting spot, and it will enable us to do some additional linguistic analysis later on (which is what I originally wanted to do).
We could potentially make this better in a lot of ways:
- Use subtitles with time information, and combine those with audio files to decide who is speaking. This is a very good option, and would get much higher accuracy.
- Add in more labelled training data, potentially with some self-labelled data.
- Tweak the algorithm.
- Clean up the input data more.
I was looking into implementing the first suggestion, but as Homer once said,
There is a time for many words, and there is also a time for sleep. Well, wrong Homer, but you get the idea. I hope this was an interesting post, and I will be making another one in the future with the technical details behind this.