Problem Set 4b

From 6.034 Wiki

Revision as of 07:23, 7 November 2006 by Shanty (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

__NUMBERS__

Contents


This problem set is due Tuesday, November 14th at 11:59 PM. If you have questions about it, ask the list 6.034-tas@mit.edu. Your response will probably come from Shannon.

To work on this problem set, you will need to get the code, much like you did for earlier problem sets.

Your answers for the problem set belong in the main file ps4b.scm.

Things to remember
  • Avoid using DrScheme's graphical comment boxes. If you do, take them out or use Save definitions as text... so that you submit a Scheme file in plain text.
  • If you are going to submit your pset late, see the problem set grading policy.

Neural Nets

In this problem set, you will be experimenting with neural nets. Although you'll only have to do the most superficial of coding, take your time and make sure you really understand what is happening when you run the nets.

Learning XOR

The first function the neural net will be learning is XOR. XOR is a binary function that returns a 1 when exactly one of its two inputs is 1. The XOR table looks like this:

A B XOR(A, B)
1 1    0
1 0    1
0 1    1
0 0    0

To do this, we are using a net with 2 input neurons, 2 hidden (internal) neurons, and 1 output neuron. Take a look in nnet-train.scm to find the procedures you will use in training the XOR function.


Hamming Distance

Implement (hamming-distance x y). Given a pair of feature vectors, it should return the number of features that vary between the two. For example,

(hamming-distance '(1 1 1) '(1 1 5))

should return 1 because one feature is different.

Euclidean Distance

Hamming distance is a reasonable distance metric for discreet features, but does not perform as well with continuous data points. Implement (euclidean-distance x y). Given a pair of feature vectors, it should return the Euclidean distance between them. Recall that the formula for Euclidean distance is:

[(x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2] ^ (1/2)

If you want to test your Euclidean distance metric on congressional data, you'll have to first convert the feature vectors into numerical ones using this method:

(congress-to-metric data) - Converts an entire dataset from yes/no/maybe votes into numerical vectors

Angular Distance

An alternative to using plain distance is to find the angle between the vectors. This accounts for similar feature sets that differ mainly in magnitude. Implement (cos-theta x y), which returns the cosine of the angle between the two vectors. Recall that this can be calculated in the following way:

(x • y) / [|x||y|]

(x dot y over magnitude of x times magnitude of y)

Implementing Nearest-Neighbors

Now that we have working distance metrics, implement (make-nn-classifier distance-metric training-examples). This sets up a nearest-neighbor classifier, which can then be used to classify feature sets in the following manner:

(define nn (make-nn-classifier distance-metric training-examples))
(nn feature-vector)

When given a feature vector, the nearest neighbor classifier should return the class of the neighbor with the least distance, as measured by distance-metric.

K-Nearest-Neighbors

To generalize, implement (make-k-nn-classifier k distance-metric training-examples). This sets up a k-nearest-neighbors classifier, which can then be used to classify feature sets in the following manner:

(define k-nn (make-k-nn-classifier k distance-metric training-examples))
(k-nn feature-vector)

When given a feature vector, the classifier should return the class most represented within the k nearest neighbors, defined by distance-metric.

When you have written code for nearest-neighbors and k-nearest-neighbors, you can use this procedure to test your classifiers on large datasets:

(validate-classifier classifier examples)

This tests a classifier against a set of labeled data, indicating the number of correct classifications.

Identification Trees

In this part of the problem set, you will be implementing code that works as part of an identification tree classification scheme. When creating an identification tree, it is desirable to order the attribute tests such that the tree is as minimal as possible. This implies ordering tests such that disorder decreases as fast as possible. In order to do this, we are using the heuristic of choosing as the next attribute to test the one that places the most elements in homogenous groups.

Basic Disorder Metric

Implement (basic-disorder attribute-to-split classes examples), which returns the negative of the number of elements in homogenous groups (so that when there are more homogenous elements, the disorder metric decreases).

Note: there is another common method used to measure disorder in the identification tree. The average disorder formula is as follows:

sum over branches: (num in branch / num total) * disorder of branch

Where disorder of branch =

sum over classes: -(num of class in branch / num in branch) * lg (num of class in branch / num in branch)

You don't have to implement this disorder metric for the problem set.


Once you have a working disorder formula, you can generate and use a decision tree in the following manner:

(define congress-id-classifier (make-idtree-classifier congress-attribute-specs train-data-large))
(congress-id-classifier feature-vector)

You may also use validate-classifier to test your decision tree as you did with nearest-neighbors:

(validate-classifier congress-id-classifier test-data-large)
Personal tools