Lab 6: KNNs & ID Trees
From 6.034 Wiki
Contents |
This lab is due by Friday, October 21 at 10:00pm.
(todo put code online) Your answers for this lab belong in the main file lab5.py. This lab covers k-nearest neighbors and identification trees.
k-Nearest Neighbors
Multiple Choice
todo: something about drawing 1NN boundaries
Python warm-up: Distance metrics
todo explain functions, Point class
def euclidean_distance(point1, point2):
def manhattan_distance(point1, point2):
def hamming_distance(point1, point2):
Now, cosine distance. (todo...)
(todo: copy explanation of dot_product and norm from Lab 6)
def dot_product(u, v):
def norm(v):
def cosine_distance(point1, point2):
todo note about cosine_distance being negated, b/c it's a similarity metric where 1 = most similar, 0 = not similar
todo note about transforming to polar coordinates
Classifying Points with k-Nearest Neighbors
We've provided a function: (todo explain)
def get_k_closest_points(point, data, k, distance_metric): "return list of k points closest to input point, based on metric" if k >= len(data): return data sorted_points = sorted(data, key=lambda p2: distance_metric(point, p2)) return sorted_points[:k]
Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):
def knn_classify_point(point, data, k, distance_metric):
todo hint: to get the mode of a list, do: max(set(my_list), key=my_list.count)
Cross-validation: Choosing the best k and distance metric
todo add cross_validate fn and find_best_...
More multiple choice
todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens
Identification Trees
Using an ID Tree to classify unknown points
todo explain function
def classify_point(point, id_tree):
See API. In particular, the methods .apply_classifier() and .get_node_classification() may be useful.
This function can be written cleanly in 3-4 lines, either iteratively or recursively.
Calculating Disorder
todo explain functions
def branch_disorder(branch):
python hints (todo): log2, list.count, float, set
def average_test_disorder(stump):
Constructing an ID Tree
todo explain functions
def find_best_classifier(data, possible_classifiers, classification_key):
todo python hint INF
def add_best_classifier_to_tree(id_tree_node, possible_classifiers):
def finish_subtree(id_tree_node, possible_classifiers):
Finally, to put it all together, we've provided a function that calls your functions to construct a complete ID tree:
def construct_greedy_identification_tree(data, possible_classifiers, classification_key): id_tree = IdentificationTreeNode(data, classification_key) finish_subtree(id_tree, possible_classifiers) return id_tree
Conceptual questions
todo
API Reference Documentation
The file api.py defines the Point, Classifier, and IdentificationTreeNode classes, as well as some helper functions for ID trees, all described below. (todo: describe helper functions)
Point (for kNN)
Point objects are used in the k-nearest neighbors section only.
A Point has the following attributes:
- point.name, the name of the point (a string), if defined.
- point.coords, the coordinates of the point, represented as a vector (a tuple or list of numbers).
- point.classification, the classification of the point, if known.
Classifier
Classifier objects are used for constructing and manipulating ID trees.
A Classifier has the following attributes:
- classifier.name, the name of the classifier.
- classifier.classify, a function that takes in a point and returns a value.
In our ID trees, a point is represented as a dict mapping attribute names to their values.
IdentificationTreeNode
In this lab, an ID tree is represented recursively as a tree of IndentificationTreeNode objects. In a completed ID tree, each node has either a classification (such as "Oak" or "Maple"), or a classifier (such as "has leaves") with branches leading to child nodes.
An IndentificationTreeNode has the following attributes and methods:
- id_tree_node.data, the training points at this node in the tree.
- id_tree_node.classification_key, the attribute by which the tree classifies points (e.g. "tree_type" for the tree data, or "Classification" for the angel data).
- id_tree_node.get_parent_branch_name(): returns the name of the branch leading to this node, or "ID Tree" if it is the root node
- id_tree_node.is_leaf(): returns True if the node is a leaf (has a classification), otherwise False
- id_tree_node.set_classifier(classifier): Uses the specified classifier to add branches below the current node. Modifies and returns the original id_tree_node. May print warnings if the specified classifier is inadvisable.
- id_tree_node.apply_classifier(point): Applies classifier to point by following the appropriate branch of the tree, then returns a child node. If node is leaf node (and thus doesn't have a classifier), raises error.
- id_tree_node.get_classifier(): returns the classifier, if any
- id_tree_node.set_node_classification(classification): Sets the node's classification, which defines the node as a leaf node. Modifies and returns original id_tree_node. May print warnings if the node already has branches defined.
- id_tree_node.get_node_classification(): returns the node's classification, if any
- id_tree_node.get_branches(): returns a dict mapping node's child branches to the nodes at the end of them
- id_tree_node.describe_type(): returns a string describing the node type (leaf, subtree, etc)