Lab 6: KNNs & ID Trees

From 6.034 Wiki

Revision as of 03:12, 5 October 2016 by Jmn (Talk | contribs)
Jump to: navigation, search

Contents


This lab is due by Friday, October 21 at 10:00pm.

(todo put code online) Your answers for this lab belong in the main file lab5.py. This lab covers k-nearest neighbors and identification trees.

k-Nearest Neighbors

Multiple Choice

todo: something about drawing 1NN boundaries

Python warm-up: Distance metrics

todo explain functions, Point class

def euclidean_distance(point1, point2):
def manhattan_distance(point1, point2):
def hamming_distance(point1, point2):

Now, cosine distance. (todo...)

(todo: copy explanation of dot_product and norm from Lab 6)

def dot_product(u, v):
def norm(v):
def cosine_distance(point1, point2):

todo note about cosine_distance being negated, b/c it's a similarity metric where 1 = most similar, 0 = not similar

todo note about transforming to polar coordinates

Classifying Points with k-Nearest Neighbors

We've provided a function: (todo explain)

def get_k_closest_points(point, data, k, distance_metric):
    "return list of k points closest to input point, based on metric"
    if k >= len(data):
        return data
    sorted_points = sorted(data, key=lambda p2: distance_metric(point, p2))
    return sorted_points[:k]

Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):

def knn_classify_point(point, data, k, distance_metric):

todo hint: to get the mode of a list, do: max(set(my_list), key=my_list.count)

Cross-validation: Choosing the best k and distance metric

todo explain functions

todo clarify different definitions of cross-validation; we're using leave-one-out

def cross_validate(data, k, distance_metric):


def find_best_k_and_metric(data):

More multiple choice

todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens

Identification Trees

Using an ID Tree to classify unknown points

todo explain function

def id_tree_classify_point(point, id_tree):

See API. In particular, the methods .apply_classifier() and .get_node_classification() may be useful.

This function can be written cleanly in 3-4 lines, either iteratively or recursively.

Calculating Disorder

todo explain functions

def branch_disorder(branch):

python hints (todo): log2, list.count, float, set


def average_test_disorder(stump):

Constructing an ID Tree

todo explain functions

def find_best_classifier(data, possible_classifiers, classification_key):

todo python hint INF

def add_best_classifier_to_tree(id_tree_node, possible_classifiers):
def finish_subtree(id_tree_node, possible_classifiers):

Finally, to put it all together, we've provided a function that calls your functions to construct a complete ID tree:

def construct_greedy_identification_tree(data, possible_classifiers, classification_key):
    id_tree = IdentificationTreeNode(data, classification_key)
    finish_subtree(id_tree, possible_classifiers)
    return id_tree

Conceptual questions

todo

Survey

Please answer these questions at the bottom of your lab5.py file:

  • NAME: What is your name? (string)
  • COLLABORATORS: Other than 6.034 staff, with whom did you work on this lab? (string, or empty string if you worked alone)
  • HOW_MANY_HOURS_THIS_LAB_TOOK: Approximately how many hours did you spend on this lab? (number or string)
  • WHAT_I_FOUND_INTERESTING: Which parts of this lab, if any, did you find interesting? (string)
  • WHAT_I_FOUND_BORING: Which parts of this lab, if any, did you find boring or tedious? (string)
  • (optional) SUGGESTIONS: What specific changes would you recommend, if any, to improve this lab for future years? (string)


(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)

When you're done, run the online tester to submit your code.

API Reference Documentation

The file api.py defines the Point, Classifier, and IdentificationTreeNode classes, as well as some helper functions for ID trees, all described below. (todo: describe helper functions)

Point (for kNN)

Point objects are used in the k-nearest neighbors section only.

A Point has the following attributes:

  • point.name, the name of the point (a string), if defined.
  • point.coords, the coordinates of the point, represented as a vector (a tuple or list of numbers).
  • point.classification, the classification of the point, if known.

Classifier

Classifier objects are used for constructing and manipulating ID trees.

A Classifier has the following attributes:

  • classifier.name, the name of the classifier.
  • classifier.classify, a function that takes in a point and returns a value.

In our ID trees, a point is represented as a dict mapping attribute names to their values.

IdentificationTreeNode

In this lab, an ID tree is represented recursively as a tree of IndentificationTreeNode objects. In a completed ID tree, each node has either a classification (such as "Oak" or "Maple"), or a classifier (such as "has leaves") with branches leading to child nodes.

An IndentificationTreeNode has the following attributes and methods:

  • id_tree_node.data, the training points at this node in the tree.
  • id_tree_node.classification_key, the attribute by which the tree classifies points (e.g. "tree_type" for the tree data, or "Classification" for the angel data).
  • id_tree_node.get_parent_branch_name(): returns the name of the branch leading to this node, or "ID Tree" if it is the root node
  • id_tree_node.is_leaf(): returns True if the node is a leaf (has a classification), otherwise False
  • id_tree_node.set_classifier(classifier): Uses the specified classifier to add branches below the current node. Modifies and returns the original id_tree_node. May print warnings if the specified classifier is inadvisable.
  • id_tree_node.apply_classifier(point): Applies classifier to point by following the appropriate branch of the tree, then returns a child node. If node is leaf node (and thus doesn't have a classifier), raises error.
  • id_tree_node.get_classifier(): returns the classifier, if any
  • id_tree_node.set_node_classification(classification): Sets the node's classification, which defines the node as a leaf node. Modifies and returns original id_tree_node. May print warnings if the node already has branches defined.
  • id_tree_node.get_node_classification(): returns the node's classification, if any
  • id_tree_node.get_branches(): returns a dict mapping node's child branches to the nodes at the end of them
  • id_tree_node.describe_type(): returns a string describing the node type (leaf, subtree, etc)
Personal tools