# What Makes Us Happy? Let’s Look at Data to Find Out.

I’ve had a lot of different jobs over the past 4 years, and I’ve had some incredible experiences along the way. Lately, I’ve been struggling with what to do next. Or perhaps more accurately, I’ve been struggling with how to decide what to do next. Decisions that seem obvious in hindsight are tough to come to grips with beforehand, and it’s led me to think about what metric I am trying to maximize. I admit that it’s odd to think of life as a way to increase certain metrics, but aren’t we doing this already in a different way? A lot of people (myself included) will at some point say that all we care about is money. Isn’t that just us saying that money is the metric we want to maximize? Now that I am older and wiser (yeah, right), I find myself increasingly concerned with maximizing my own happiness.

Happiness is this strange concept that nobody quite understands, nobody quite wants to admit to not having, and isn’t always consistent (do we care about moment-to-moment happiness, or long-term happiness?). I think Wikipedia says it best with “happiness is a fuzzy concept and can mean different things to different people.” Great. Thanks, Wikipedia. The more I read about happiness research, the more I realized that in order to truly understand what makes me happy, I would have to start capturing information about it. This has led me to develop an Android application called Happsee that can be used to track and visualize happiness. The results have been fantastic so far, and I will be sharing them in a later post. This post, though, is about looking at happiness on a higher level.

In the course of creating Happsee, I met/have been meeting with a lot of people in the Boston area who are doing interesting things in the field. One of these people is Daniel Hadley, the head of SomerStat, a very cool department of the city of Somerville that tries to quantify various aspects of life in the city and make them better. One of these aspects is happiness, and SomerStat has compiled data on happiness in Somerville going back to 2006. I tried not to let my eyes get too wide when Daniel told me about this data (normal people like normal things — I like data). He was gracious enough to share the anonymized data with me, and I will be looking at it in this post in order to see if it can help us better understand happiness.

# Open Sourcing Movide, a Student-centric Learning Platform

I haven’t blogged in a while, mostly because I have been trying to figure out what I should do next. One thing that I have been working on lately that I am very passionate about is Movide. Movide is a student-centric learning platform. You might yawn at this point and wonder why Movide matters. It’s a natural reaction, given the crowded learning tools marketplace. Movide, matters, I think, because it is an open source attempt to change the LMS and learning tool paradigm.

Traditional LMS tools are great for what they do — enable course content to be translated from in-classroom to online. They often serve as content-delivery mechanisms rather than skills measurement mechanisms. This is extremely useful, and I have learned a lot from content hosted on LMS sites. “Social” LMS tools are also fantastic for what they do — reimage the LMS experience for a more connected, Facebook-driven era. These often have content delivered in social-media like “streams.” I know both of these characterizations are oversimplifications, but I have limited space.

Neither tool can cover all learners and learning styles, and as I reflected more about how I learn, talked with people about how they learn, and read about research in the field, I realized that perhaps there was space for a tool that approached learning in a different way.

# The Power, and Danger, of Visualizations

I recently posted about visualizing the voting patterns of senators. In the post, I scraped voting data for each senator on every vote in the 113th Congress from the Senate website, and then assigned a code of 0 for a no vote on a particular issue, 1 for a yes vote, 2 for abstention, and 3 if the senator was not in office at the time of the vote (ie, a senator was switched mid-term).

I was then able to turn this data into two dimensions and plot it to show how the voting patterns of the senators differed. This led to the plot below:

From the above plot, it appears that the Massachusetts and New Jersey senators are very extreme in their voting patterns. Both Juan Carlos Borrás and Fr. pointed out in the comments that this was due to the coding of the votes. Massachusetts and New Jersey had senators who were changed mid-term, causing their votes to be coded as a 3. Since nobody else had votes coded as a 3, this caused them to appear like they had very different voting patterns, when in fact, they just were not there.

I had known of the 3-coding initially, but opted to keep the data “as-is.” The more I think about it, the more I realize that this could be used to spin a false narrative. I easily could say “democrats tend to be very extreme in their voting, just look at John Kerry!”, or “Massachusetts is the most radical state in the country!” based on the above chart. Of course, neither of these statements are strictly true, but the chart above, which is based on accurate data, could be used to tell such a story.

If we reconstruct the chart without any senators who were switched mid-term, we can tell a very different story.

# On the Automated Scoring of Essays and the Lessons Learned Along the Way

We’ve all written essays, primarily while we were in school. The sometimes enjoyable process of researching the topic and composing the paper can take hours and hours of careful work. Given this, people react badly to the notion that their essays may be scored not by a human teacher, but by machine.

A piece of software coldly judging the quality of our carefully constructed phrases and metaphors based on unknown criteria is more than most writers can bear. But is this what automated essay scoring (AES) is? If not, what is it? In this article, I aim to explore what AES is, the state of field, some of the lessons I have learned along the way, and where I think it is going.

# How Divided Is the Senate?

I very seldom pay attention to politics directly, because politics have always seemed a bit circular and cyclical to me. Most of the political news that I take in ends up worming its way into the news sources that I do consume, like the excellent longform.org. Even given my limited intake of political news, one trend that I have noticed lately is the increasing number of references to the Senate as “polarized” or “divided.” Here is a link to an interesting series of charts on polarization. Is it possible to quantify this polarization? Can quantifying the polarization enable us to draw interesting conclusions?

As I started to walk down this road, I figured that it would be tough to find the data that I needed. My time in the US foreign service showed me just how slow the government can be at effectively publishing and using data. Imagine my surprise when I found that the senate website has a very convenient listing of all of the votes from the 101st congress to the 113th (current) congress. This data tells us, for each vote, whether each senator voted yes, no, or abstained.

From the vote data, we can generate plots showing how polarized the Senate is. We will assume that two people are not polarized if they have similar voting patterns. If we take only this vote, we would assume that Senator Ayotte and Senator Alexander, who both voted no, are not polarized, as they share the same opinion. This is well and good, but one bill isn’t really reflective of the voting records of the two Senators. If we really want to figure out where they stand, we would need to perform the analysis across all votes. I will describe the process further down, but for now, let’s jump to a polarization chart:

The above chart has a dot for each Senator, although only some senators are labelled due to space constraints. The further apart the dots are, the more the views of the two senators contrast. Dots are shaded by political affiliation. How can we generate this chart? Keep reading to find out.

# Programming Instrumental Music From Scratch

I recently posted about automatically making music. The algorithm that I made pulled out interesting sequences of music from existing songs and remixed them. While this worked reasonably well, it also didn’t have full control over the basics of the music; it wasn’t actually specifying which instruments to use, or what notes to play.

Maybe I’m being a control freak, but it would be nice to have complete control over exactly what is being played and how it is being played, rather than making a “remixing engine” (although the remixing engine is cool). It would also kind of fulfill my on-and-off ambition of playing the guitar (I’m really bad at it).

Enter the MIDI format. The MIDI format lets you specify pitch, velocity, and instrument. You can specify different instruments in different tracks, and then combine the tracks to make a song. You can write the song into a file, after which you can covert it to sound (I’ll describe this process a bit more further down). Using the power of MIDI, we can define music from the ground up using a computer, and then play it back.

Now that we know that something like MIDI exists, we can define our algorithm like this:

• Calibrate track generation by reading in a lot of MIDI tracks
• Generate instrumental tracks and tempo tracks
• Combine the instrumental tracks to make songs
• Convert the songs into sound
• Judge the quality of the sound
• Now that we know which songs are good and which songs are bad, remove the bad songs, generate new songs, and repeat

One important thing to note is that we can analyze (in fact, we have to analyze) a lot of songs to calibrate the process by which we do the instrumental track generation, the first step. So we can generate tracks that take on the characteristics of any genre we want. We are also indebted to the human composers and artists who created the music in the first place. This algorithm is less to replace them than to explore music creation in my own way. All of the code for the algorithm is available here.

I got instrumental tracks from midi world and midi archive. A lot of the free midi sites use sessions to discourage scraping, and these were the only two I could find that do not have such provisions.

Below are some sample tracks created using the algorithm. If the player below does not show up you may have to visit my site to see it.

# Evolve Your Own Beats: Automatically Generating Music via Algorithms

Update: you can find the next post in this series here.

I recently went to an excellent music meetup where people spoke about the intersection of music and technology. One speaker in particular talked about how music is now being generated by computer.

Music has always fascinated me. It can make us feel emotions in a way few media can. Sadly, I have always been unable to play an instrument well. Generating music by computer lets me leverage one of my strengths, computer programming (which, contrary to popular belief, can be extremely creative), in order to make music. Although I’m not exactly sure how people are doing it now (explicit parsing rules?), I thought of a way to do it algorithmically (because everything is more fun with algorithms, right?)

I opted to pursue a strategy that “evolves” music out of other pieces of music. I chose this strategy in order to emphasize the process. Seeing a track take shape is very exciting, and you can go back to its history and experience each of the tracks that took part in its creation.

I’m going to broadly outline the keys to my strategy below. You might think that a lot of my points are crazy, so bear with me for a while(at least until I can prove them out later):

• We can easily acquire music that is already categorized (ie labelled as classical/techno/electronic, etc)
• We can teach a computer to categorize new music automatically.
• Teaching a computer to categorize music automatically will give us a musical quality assessment tool (patent pending on the MQAT!)
• Once we have an assessment tool, we can generate music, and then use the assessment tool to see if it is any good
• Profit?

One important thing to note is that while the music itself is automatically generated, the building blocks of the final song are extracted from existing songs and mixed together. So don’t worry about this replacing humans anytime. The key to the algorithm is actually finding readily available, free music, and I would sincerely like to thank last.fm and the wonderful artists who put up their music up in their free downloads section. I am not sure how much my algorithm has obfuscated the original sounds of the music, but if anyone recognizes theirs, I would love to hear from them.

Here are some samples of the computer generated music. Because of the way they are generated, they all have short riffs that repeat. I want to try to improve this behavior in the future, but I will let your judge what is here now for yourself. Caution: Listening to these may give you a headache. If the player below does not show up you may have to visit my site to see it.

# Making Infographics Using R and Inkscape

I have been making charts with R for almost as long as I have been using R, and with good reason: R is an amazing tool for filtering and visualizing data. With R, and particularly if we use the excellent ggplot2 library, we can go from raw data to compelling visualization in minutes.

But what if we want to give our visualizations an extra kick? What if we want to do some manual retouching? I had long resisted this, thinking that conveying the data was the major concern, and it was up to viewer to parse it how they saw fit. As visualizations become more and more important, it is evident to me that merely conveying the data is not enough; these days, a visualization must also be visually attractive.

With this realization, I started to research how to make infographics and visualizations. This leads quickly to tools like d3.js, which while inarguably useful, are also fairly difficult to use.

I then came upon the concept of retouching charts generated in R using a tool like Adobe Illustrator or Inkscape. Inkscape seems to be less full-featured, but it is free, which is very compelling. I use Linux, so acquiring Inkscape is very simple, and I decided to use Inkscape.

This post will take us from a raw chart exported from R to a finished infographic. The final graphic is below:

# Do the Simpsons Characters Like Each Other?

One day, while I was walking around Cambridge, I had a random thought — how do the characters on the Simpsons feel about each other? It doesn’t take long to figure out how Homer feels about Flanders (hint: he doesn’t always like him), or how Burns feels about everyone, but how does Marge feel about Bart? How does Flanders feel about Homer? I then realized that I work with algorithms — maybe I would be able to devise one to answer this question. After all, I did something similar with the Wikileaks cables.

This idle thought led me down a very deep rabbit hole. The most glaring problem was that no full scripts of the Simpsons exist. There are full transcripts of each episode, with no information on who is speaking each line.

I first tried using natural language processing techniques to determine who was speaking each line. This worked reasonably well, but I felt that it was still missing something. I then directly analyzed the audio from the episodes to figure out the “voice fingerprints” for each character, which I used to label the lines. This was better than just looking at the text of the lines. I wanted to combine these techniques, but ran out of time. It can be fairly easily done at some later date to increase accuracy.

From the labelled lines, we can determine how much each of the characters likes the rest. If you want to skip ahead, the heatmap of how much the characters like each other is below. It shows how much each character in the row likes each character in the column. Some characters may feel differently about each other (for example, check out Krusty and Lisa). Red indicates dislike, and green indicates like.

# Using the Power of Sound to Figure Out Which Simpsons Character Is Speaking

Update: you can find the next post in this series here.

In a previous post, I looked at transcripts of Simpsons episodes and tried to figure out which character was speaking which line.

This worked decently, but it wasn’t great. It gave us memorable scenes like this one:

And this one:

And some not so memorable scenes:

Trying to identify who is speaking only by looking at the text is a bit like trying to walk in a straight line with your eyes closed. There is a lot of information that you end up missing.

What if I told you that one of your friends asked me hey, how's it going?, and I asked you to figure out which friend. Even if you know someone for years and years, it won’t help you figure it out.

Enter the amazing sound wave. If I played you a sound clip of your friend saying the same phrase, you would almost instantly know who said it. Audio has a lot of information in this context that text cannot convey, and if we want to accurately identify our Simpsons characters, we need to use it.

As we progress, keep in mind that the code for this is available here, but this is the non-technical explanation. I will make a full technical post once I evaluate the various methods.