Lab 4 2014

From 6.034 Wiki

(Difference between revisions)
Jump to: navigation, search
(lab 4)
(due friday)
Line 1: Line 1:
__TOC__
__TOC__
-
This problem set is due Thursday, November 6th at 11:59 PM. If you have questions about it, ask the list ''6.034-tas@mit.edu''.
+
This problem set is due Friday, November 7th at 11:59 PM. If you have questions about it, ask the list ''6.034-tas@mit.edu''.
To work on this problem set, you will need to get the code:
To work on this problem set, you will need to get the code:

Revision as of 05:46, 31 October 2008

Contents


This problem set is due Friday, November 7th at 11:59 PM. If you have questions about it, ask the list 6.034-tas@mit.edu.

To work on this problem set, you will need to get the code:

Classifying Congress

You've been hired to do political modeling as part of the newly-elected President's transition team. He seems to surround himself with smart people that way.

He takes a moment out of his busy day to explain what you need to do. "I need to be able to tell which of my plans are going to be supported by Congress," he explains. "Do you think we can get a model of Democrats and Republicans in Congress, and which votes separate them the most?"

"Yes we can," you answer.

The data

You acquire the data on how everyone in the previous Senate and House of Representatives voted on every issue. (All this data is available in machine-readable form on voteview.com. We've included it in the lab directory, in the files beginning with H110 and S110.)

data_reader.py contains functions for reading data in this format.

read_congress_data("FILENAME.ord") reads a specially-formatted file that gives information about each Congressperson and the votes they cast. It returns a list of dictionaries, one for each member of Congress, including the following items:

  • 'name': The name of the Congressperson.
  • 'state': The state they represent.
  • 'party': The party that they were elected under.
  • 'votes': The votes that they cast, as a list of numbers. 1 represents a "yea" vote, -1 represents "nay", and 0 represents either that they abstained, were absent, or were not a member of Congress at the time.

To make sense of the votes, you will also need information about what they were voting on. This is provided by read_vote_data("FILENAME.csv"), which returns a list of votes in the same order that they appear in the Congresspeople's entries. Each vote is represented a dictionary of information, which you can convert into a readable string by running vote_info(vote).

The lab file reads in the provided data, storing them in the variables senate_people, senate_votes, house_people, and house_votes.

Nearest neighbors

You decide to start by making a nearest-neighbors classifier that can tell Democrats apart from Republicans in the Senate.

We've provided a nearest_neighbors function that classifies data based on training data and a distance function. In particular, this is a third-order function:

  • First you call nearest_neighbors(distance, k), with distance being the distance function you wish to use and k being the number of neighbors to check. This returns a classifier factory.
  • A classifier factory is a function that makes classifiers. You call it with some training data as an argument, and it returns a classifier.
  • Finally, you call the classifier with a data point (here, a Congressperson) and it returns the classification as a string.

Much of this is handled by the evaluate(factory, group1, group2) function, which you can use to test the effectiveness of a classification strategy. You give it a classifier factory (as defined above) and two sets of data. It will train a classifier on one data set and test the results against the other, and then it will switch them and test again.

Given a list of data such as senate_people, you can divide it arbitrarily into two groups using the crosscheck_groups(data) function.

One way to measure the "distance" between Congresspeople is with the Hamming distance: the number of entries that differ. This function is provided as hamming_distance.

An example of putting this all together is provided in the lab code:

senate_group1, senate_group2 = crosscheck_groups(senate_people)
evaluate(nearest_neighbors(edit_distance, 1), senate_group1, senate_group2, verbose=1)

Examine the results of this evaluation. In addition to the problems caused by independents, it's classifying Senator Johnson from South Dakota as a Republican instead of a Democrat, mainly because he missed a lot of votes while he was being treated for cancer. This is a problem with the distance function -- when one Senator votes yes and another is absent, that is less of a "disagreement" than when one votes yes and the other votes no.

You should fix this. Euclidean distance is a reasonable measure for the distance between lists of discrete numeric features. Recall that the formula for Euclidean distance is:

[(x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2] ^ (1/2)

  • Make a distance function called euclidean_distance that treats the votes as high-dimensional vectors, and returns the Euclidean distance between them.

When you evaluate using euclidean_distance, you should get better results, except that some people are being classified as Independents. Given that there are only 2 Independents in the Senate, you want to avoid classifying someone as an Independent just because they vote similarly to one of them.

  • Make a simple change to the parameters of nearest_neighbors that accomplishes this, and call the classifier factory it outputs my_classifier.

ID Trees

So far you've classified Democrats and Republicans, but you haven't created a model of which votes distinguish them. You want to make a classifier that explains the distinctions it makes, so you decide to use an ID-tree classifier.

idtree_maker(votes, disorder_metric) is a third-order function similar to nearest_neighbors. You initialize it by giving it a list of vote information (such as senate_votes or house_votes) and a function for calculating the disorder of two classes. It returns a classifier factory that will produce instances of the CongressIDTree class, defined in classify.py, to distinguish legislators based on their votes.

The possible decision boundaries used by CongressIDTree are, for each vote:

  • Did this legislator vote YES on this vote, or not?
  • Did this legislator vote NO on this vote, or not?

These are different because it is possible for a legislator to abstain or be absent.

You can also use CongressIDTree directly to make an ID tree over the entire data set.

If you print a CongressIDTree, then you get a text representation of the tree. Each level of the ID tree shows the minimum disorder it found, the criterion that gives this minimum disorder, and (marked with a +) the decision it makes for legislators who match the criterion, and (marked with a -) the decision for legislators who don't. The decisions are either a party name or another ID tree. An example is shown in the section below.

An ID tree for the entire Senate

You start by making an ID tree for the entire Senate. This doesn't leave you anything to test it on, but it will show you the votes that distinguish Republicans from Democrats the most quickly overall. You run this (which you can uncomment in your lab file):

print CongressIDTree(senate_people, senate_votes, homogeneous_disorder)

The ID tree you get here is:

Disorder: -49
Yes on S.Con.Res. 21: Kyl Amdt. No. 583; To reform the death tax by setting the
exemption at $5 million per estate, indexed for inflation, and the top death
tax rate at no more than 35% beginning in 2010; to avoid subjecting an
estimated 119,200 families, family businesses, and family farms to the death
tax each and every year; to promote continued economic growth and job creation;
and to make the enhanced teacher deduction permanent.:
+ Republican
- Disorder: -44
  Yes on H.R. 1585: Feingold Amdt. No. 2924; To safely redeploy United States
  troops from Iraq.:
  + Democrat
  - Disorder: -3
    No on H.R. 1495: Coburn Amdt. No. 1089; To prioritize Federal spending to
    ensure the needs of Louisiana residents who lost their homes as a result of
    Hurricane Katrina and Rita are met before spending money to design or
    construct a nonessential visitors center.:
    + Democrat
    - Disorder: -2
      Yes on S.Res. 19: S. Res. 19; A resolution honoring President Gerald
      Rudolph Ford.:
      + Disorder: -4
        Yes on H.R. 6: Motion to Waive C.B.A. re: Inhofe Amdt. No. 1666; To
        ensure agricultural equity with respect to the renewable fuels standard.:
        + Democrat
        - Independent
      - Republican

Some things that you can observe from these results are:

  • Senators like to write bills with very long-winded titles that make political points.
  • The key issue that most clearly divided Democrats and Republicans was the issue that Democrats call the "estate tax" and Republicans call the "death tax", with 49 Republicans voting to reform it.
  • The next key issue involved 44 Democrats voting to redeploy troops from Iraq.
  • The issues below that serve only to peel off homogenous groups of 2 to 4 people.

Implementing a better disorder metric

You should be able to reduce the depth and complexity of the tree, by changing the disorder metric from the one that looks for the largest homogeneous group to the information-theoretical metric described in lecture.

You can find this formula on page 429 of the reading.

  • Write the information_disorder(group1, group2) function to replace homogeneous_disorder. This function takes in the lists of classifications that fall on each side of the decision boundary, and returns the information-theoretical disorder.

Example:

information_disorder(["Democrat", "Democrat", "Democrat"], ["Republican", "Republican"])
  => 0.0
information_disorder(["Democrat", "Republican"], ["Republican", "Democrat"])
  => 1.0

Once this is written, you can try making a new CongressIDTree with it.

Evaluating over the House of Representatives

Now, you decide to evaluate how well ID trees do in the wild, weird world of the House of Representatives.

You can try running an ID tree on the entire House and all of its votes. It's disappointing. The 110th House began with a vote on the rules of order, where everyone present voted along straight party lines. It's not a very informative result to observe that Democrats think Democrats should make the rules and Republicans think Republicans should make the rules.

Anyway, since your task was to make a tool for classifying the newly-elected Congress, you'd like it to work after a relatively small number of votes. We've provided a function, limited_house_classifier, which evaluates an ID tree classifier that uses only the most recent N votes in the House of Representatives. You just need to find a good value of N.

  • Using limited_house_classifier, find a good number N of votes to take into account, so that the resulting ID trees classify at least 430 Congresspeople correctly.

The total number of Congresspeople in the evaluation may change, as people who didn't vote in the last N votes (perhaps because they're not in office anymore) aren't included. Also, beware of overfitting -- putting in more votes is not always an improvement.

Survey

Please answer these questions at the bottom of your ps4.scm file:

  • How many hours did this problem set take?
  • Which parts of this problem set, if any, did you find interesting?
  • Which parts of this problem set, if any, did you find boring or tedious?

(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)

Errata

If you find what you think is an error in the problem set, tell 6.034-tas@mit.edu about it.

Personal tools