Lab 6: KNNs & ID Trees

From 6.034 Wiki

Revision as of 03:10, 5 October 2016 by Jmn (Talk | contribs)
Jump to: navigation, search

Contents


This lab is due by Friday, October 21 at 10:00pm.

(todo put code online) Your answers for this lab belong in the main file lab5.py. This lab covers k-nearest neighbors and identification trees.

k-Nearest Neighbors

Multiple Choice

todo: something about drawing 1NN boundaries

Python warm-up: Distance metrics

todo explain functions, Point class

def euclidean_distance(point1, point2):
def manhattan_distance(point1, point2):
def hamming_distance(point1, point2):

Now, cosine distance. (todo...)

(todo: copy explanation of dot_product and norm from Lab 6)

def dot_product(u, v):
def norm(v):
def cosine_distance(point1, point2):

todo note about cosine_distance being negated, b/c it's a similarity metric where 1 = most similar, 0 = not similar

todo note about transforming to polar coordinates

Classifying Points with k-Nearest Neighbors

We've provided a function: (todo explain)

def get_k_closest_points(point, data, k, distance_metric):
    "return list of k points closest to input point, based on metric"
    if k >= len(data):
        return data
    sorted_points = sorted(data, key=lambda p2: distance_metric(point, p2))
    return sorted_points[:k]

Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):

def knn_classify_point(point, data, k, distance_metric):

todo hint: to get the mode of a list, do: max(set(my_list), key=my_list.count)

Cross-validation: Choosing the best k and distance metric

todo explain functions

todo clarify different definitions of cross-validation; we're using leave-one-out

def cross_validate(data, k, distance_metric):


def find_best_k_and_metric(data):

More multiple choice

todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens

Identification Trees

Using an ID Tree to classify unknown points

todo explain function

def id_tree_classify_point(point, id_tree):

See API. In particular, the methods .apply_classifier() and .get_node_classification() may be useful.

This function can be written cleanly in 3-4 lines, either iteratively or recursively.

Calculating Disorder

todo explain functions

def branch_disorder(branch):

python hints (todo): log2, list.count, float, set


def average_test_disorder(stump):

Constructing an ID Tree

todo explain functions

def find_best_classifier(data, possible_classifiers, classification_key):

todo python hint INF

def add_best_classifier_to_tree(id_tree_node, possible_classifiers):
def finish_subtree(id_tree_node, possible_classifiers):

Finally, to put it all together, we've provided a function that calls your functions to construct a complete ID tree:

def construct_greedy_identification_tree(data, possible_classifiers, classification_key):
    id_tree = IdentificationTreeNode(data, classification_key)
    finish_subtree(id_tree, possible_classifiers)
    return id_tree

Conceptual questions

todo

API Reference Documentation

The file api.py defines the Point, Classifier, and IdentificationTreeNode classes, as well as some helper functions for ID trees, all described below. (todo: describe helper functions)

Point (for kNN)

Point objects are used in the k-nearest neighbors section only.

A Point has the following attributes:

  • point.name, the name of the point (a string), if defined.
  • point.coords, the coordinates of the point, represented as a vector (a tuple or list of numbers).
  • point.classification, the classification of the point, if known.

Classifier

Classifier objects are used for constructing and manipulating ID trees.

A Classifier has the following attributes:

  • classifier.name, the name of the classifier.
  • classifier.classify, a function that takes in a point and returns a value.

In our ID trees, a point is represented as a dict mapping attribute names to their values.

IdentificationTreeNode

In this lab, an ID tree is represented recursively as a tree of IndentificationTreeNode objects. In a completed ID tree, each node has either a classification (such as "Oak" or "Maple"), or a classifier (such as "has leaves") with branches leading to child nodes.

An IndentificationTreeNode has the following attributes and methods:

  • id_tree_node.data, the training points at this node in the tree.
  • id_tree_node.classification_key, the attribute by which the tree classifies points (e.g. "tree_type" for the tree data, or "Classification" for the angel data).
  • id_tree_node.get_parent_branch_name(): returns the name of the branch leading to this node, or "ID Tree" if it is the root node
  • id_tree_node.is_leaf(): returns True if the node is a leaf (has a classification), otherwise False
  • id_tree_node.set_classifier(classifier): Uses the specified classifier to add branches below the current node. Modifies and returns the original id_tree_node. May print warnings if the specified classifier is inadvisable.
  • id_tree_node.apply_classifier(point): Applies classifier to point by following the appropriate branch of the tree, then returns a child node. If node is leaf node (and thus doesn't have a classifier), raises error.
  • id_tree_node.get_classifier(): returns the classifier, if any
  • id_tree_node.set_node_classification(classification): Sets the node's classification, which defines the node as a leaf node. Modifies and returns original id_tree_node. May print warnings if the node already has branches defined.
  • id_tree_node.get_node_classification(): returns the node's classification, if any
  • id_tree_node.get_branches(): returns a dict mapping node's child branches to the nodes at the end of them
  • id_tree_node.describe_type(): returns a string describing the node type (leaf, subtree, etc)
Personal tools