Lab 5 2014

From 6.034 Wiki

(Difference between revisions)
Jump to: navigation, search
(A first draft towards a better lab 5)
m
Line 7: Line 7:
* Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip
* Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip
* Or, on Athena, <tt>add 6.034</tt> and copy it from <tt>/mit/6.034/www/labs/lab5/</tt>.
* Or, on Athena, <tt>add 6.034</tt> and copy it from <tt>/mit/6.034/www/labs/lab5/</tt>.
-
* <font size="+2">You will need to <b>download and install</b> an additional software package called <a href="http://www.ailab.si/orange/">Orange</a>.  Do this first so that you get the problems worked out early.  If you have downloaded and installed it, you should be able to run lab5.py with no changes, and see the output of several classifiers on the vampire dataset.</font>  If you get errors, email us.
+
* <font size="+2">You will need to <b>download and install</b> an additional software package called [http://www.ailab.si/orange/ Orange].  Do this first so that you get the problems worked out early.  If you have downloaded and installed it, you should be able to run lab5.py with no changes, and see the output of several classifiers on the vampire dataset.</font>  If you get errors, email us.
Your answers for the problem set belong in the main file <tt>lab5.py</tt>, as well as <tt>boost.py</tt>.
Your answers for the problem set belong in the main file <tt>lab5.py</tt>, as well as <tt>boost.py</tt>.

Revision as of 22:39, 17 November 2009

Contents


This is the last problem set in 6.034! It's due Friday, December 4th at 11:59 PM.

To work on this problem set, you will need to get the code:

  • You can view it at: http://web.mit.edu/6.034/www/labs/lab5/
  • Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip
  • Or, on Athena, add 6.034 and copy it from /mit/6.034/www/labs/lab5/.
  • You will need to download and install an additional software package called Orange. Do this first so that you get the problems worked out early. If you have downloaded and installed it, you should be able to run lab5.py with no changes, and see the output of several classifiers on the vampire dataset. If you get errors, email us.

Your answers for the problem set belong in the main file lab5.py, as well as boost.py.

Boosting

You're still trying to use AI to predict the votes of politicians. You decide that the ID-tree classifier was too rigid and uninformative, so now you try using a boosting classifier instead.

To make sure that you interpret the results without letting your political preconceptions get in the way, you dig up some old data to work with: in particular, the data from the 4th House of Representatives, which met from 1796 to 1797. (According to the records on voteview.com, this is when the two-party system first emerged, with the two parties being designated "Federalists" and "Republicans".)

You experiment with that data before going on to the 2007-2008 data, finding that Congressmen in 1796 were much more clear about what they were voting on than in 2008.

The framework for a boosting classifier can be found in boost.py. You need to finish coding it, and then use it to learn some classifiers and answer a few questions.

The following resources will be helpful:

  • The documentation for the boosting code, which you can find embedded in boost.py in the documentation strings.
  • The Shapire paper on boosting, or the notes that summarizes it.
  • The Lab 4 writeup, if you need to refer to how data_reader.py represents legislators and votes.

A (clever|cheap) trick

The boosting code uses a trick that means it only has to try half the number of base classifiers.

It turns out that AdaBoost does not really care which side of its base classifier is +1 and which side is -1. If you choose a classifier that is the opposite of the best classifier -- it returns -1 for most points that should be +1, and returns +1 for most points that should be -1, and therefore has a high error rate -- it works the same as if you had chosen the negation of that classifier, which is the best classifier.

If the data reweighting step is implemented correctly, it will produce the same weights given a classifier or its opposite. Also, a classifier with a high error rate will end up with a negative alpha value, so that in the final "vote" of classifiers it will act like its opposite. So the important thing about a classifier isn't that its error rate is low -- it's that the error rate is far from 1/2.

In the boosting code, we take advantage of this. We include only classifiers that output +1 for voting YES on a particular motion, or +1 for voting NO on a particular motion, and as the "best classifier" we choose the classifier whose error rate is farthest from 1/2. If the error rate is high, then the result will act like a classifier that outputs +1 for "voting NO or abstaining", or +1 for "voting YES or abstaining", respectively. This means we don't have to include these classifiers in the base classifier list, speeding up the algorithm by a factor of 2.

Completing the code

Here are the parts that you need to complete:

  • In the BoostClassifier class in boost.py, the update_weights method is undefined. You need to define this method so that it changes the data weights in the way prescribed by the AdaBoost algorithm. There are two ways of implementing this update which happen to be mathematically equivalent.
  • In the BoostClassifier class, the classify method is also undefined. Define it so that you can use a trained BoostClassifier as a classifier, outputting +1 or -1 based on the weighted results of its base classifiers.
  • In lab5.py, the most_misclassified function is undefined. You will need to define it to answer the questions.

Questions

Answer the questions in lab5.py.

When you are asked how a particular political party would vote on a particular motion, disregard the possibility of abstaining. If your classifier results indicate that the party wouldn't vote NO, consider that an indication that the party would vote YES.

== Orange you glad someone else implemented these?

First things first: Have you installed Orange yet?

Now that you've installed Orange, when you run lab5.py, does it complain about Orange, or does it show you the outputs of some classifiers on vampire data?

Getting familiar with Orange

This part is optional: it's about using the Orange GUI to do a little machine learning without doing any programming. We've given you some data files (vampires.tab, H004.tab, etc.) that you can play with. Try making something like the screenshot here, and look at the performance, and look at the actual predictions.

You will find that the classifiers do much better on vampires and identifying political parties based on votes, than they do on FIXME.

Using Orange from Python

We have written out a small example using the vampire dataset. Abstract some pieces of these out into the functions we provide, and answer the FIXME questions using those pieces.

Boosting with Orange

You may be able to do better on the FIXME dataset by using AdaBoost to combine the various classifiers into a new one. Fill in the FIXME function, so that the boosting part uses the outputs of these classifiers as well.

Errata

We expect a few: check back.