Lab 6: KNNs & ID Trees
From 6.034 Wiki
(→k-Nearest Neighbors: skeleton) |
m |
||
Line 3: | Line 3: | ||
This lab is due by Friday, October 21 at 10:00pm. | This lab is due by Friday, October 21 at 10:00pm. | ||
+ | (todo put code online) | ||
+ | <!-- | ||
To work on this lab, you will need to get the code, much like you did for the previous labs. You can: | To work on this lab, you will need to get the code, much like you did for the previous labs. You can: | ||
* Use Git on your computer: <tt>git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/lab5</tt> | * Use Git on your computer: <tt>git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/lab5</tt> | ||
Line 8: | Line 10: | ||
* Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip | * Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip | ||
* View the files individually: http://web.mit.edu/6.034/www/labs/lab5/ | * View the files individually: http://web.mit.edu/6.034/www/labs/lab5/ | ||
- | + | --> | |
Your answers for this lab belong in the main file <tt>lab5.py</tt>. This lab covers k-nearest neighbors and identification trees. | Your answers for this lab belong in the main file <tt>lab5.py</tt>. This lab covers k-nearest neighbors and identification trees. | ||
Revision as of 19:58, 4 October 2016
Contents |
This lab is due by Friday, October 21 at 10:00pm.
(todo put code online) Your answers for this lab belong in the main file lab5.py. This lab covers k-nearest neighbors and identification trees.
k-Nearest Neighbors
Multiple Choice
todo: something about drawing 1NN boundaries
Python warm-up: Distance metrics
todo explain functions, Point class
def euclidean_distance(point1, point2):
def manhattan_distance(point1, point2):
def hamming_distance(point1, point2):
Now, cosine distance. (todo...)
(todo: copy explanation of dot_product and norm from Lab 6)
def dot_product(u, v):
def norm(v):
def cosine_distance(point1, point2):
todo note about cosine_distance being negated, b/c it's a similarity metric where 1 = most similar, 0 = not similar
todo note about transforming to polar coordinates
Classifying Points with k-Nearest Neighbors
We've provided a function: (todo explain)
def get_k_closest_points(point, data, k, distance_metric): "return list of k points closest to input point, based on metric" if k >= len(data): return data sorted_points = sorted(data, key=lambda p2: distance_metric(point, p2)) return sorted_points[:k]
Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):
def classify(point, data, k, distance_metric):
Cross-validation: Choosing the best k and distance metric
todo add cross_validate fn and find_best_...
More multiple choice
todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens
Identification Trees
Using an ID Tree to classify unknown points
todo explain function
def classify_point(point, id_tree):
See API. In particular, the methods .apply_classifier() and .get_node_classification() may be useful.
This function can be written cleanly in 3-4 lines, either iteratively or recursively.
Calculating Disorder
todo explain functions
def branch_disorder(branch):
python hints (todo): log2, list.count, float, set
def average_test_disorder(stump):
Constructing an ID Tree
todo explain functions
def find_best_classifier(data, possible_classifiers, classification_key):
todo python hint INF
def add_best_classifier_to_tree(id_tree_node, possible_classifiers):
def finish_subtree(id_tree_node, possible_classifiers):
Finally, to put it all together, we've provided a function that calls your functions to construct a complete ID tree:
def construct_greedy_identification_tree(data, possible_classifiers, classification_key): id_tree = IdentificationTreeNode(data, classification_key) finish_subtree(id_tree, possible_classifiers) return id_tree
Conceptual questions
todo
API Reference Documentation
The file api.py defines the Point, Classifier, and IdentificationTreeNode classes, as well as some helper functions for ID trees, all described below. (todo: describe helper functions)
Point (for kNN)
Point objects are used in the k-nearest neighbors section only.
A Point has the following attributes:
- point.name, the name of the point (a string), if defined.
- point.coords, the coordinates of the point, represented as a vector (a tuple or list of numbers).
- point.classification, the classification of the point, if known.
Classifier
Classifier objects are used for constructing and manipulating ID trees.
A Classifier has the following attributes:
- classifier.name, the name of the classifier.
- classifier.classify, a function that takes in a point and returns a value.
In our ID trees, a point is represented as a dict mapping attribute names to their values.
IdentificationTreeNode
In this lab, an ID tree is represented recursively as a tree of IndentificationTreeNode objects. In a completed ID tree, each node has either a classification (such as "Oak" or "Maple"), or a classifier (such as "has leaves") with branches leading to child nodes.
An IndentificationTreeNode has the following attributes and methods:
- id_tree_node.data, the training points at this node in the tree.
- id_tree_node.classification_key, the attribute by which the tree classifies points (e.g. "tree_type" for the tree data, or "Classification" for the angel data).
- id_tree_node.get_parent_branch_name(): returns the name of the branch leading to this node, or "ID Tree" if it is the root node
- id_tree_node.is_leaf(): returns True if the node is a leaf (has a classification), otherwise False
- id_tree_node.set_classifier(classifier): Uses the specified classifier to add branches below the current node. Modifies and returns the original id_tree_node. May print warnings if the specified classifier is inadvisable.
- id_tree_node.apply_classifier(point): Applies classifier to point by following the appropriate branch of the tree, then returns a child node. If node is leaf node (and thus doesn't have a classifier), raises error.
- id_tree_node.get_classifier(): returns the classifier, if any
- id_tree_node.set_node_classification(classification): Sets the node's classification, which defines the node as a leaf node. Modifies and returns original id_tree_node. May print warnings if the node already has branches defined.
- id_tree_node.get_node_classification(): returns the node's classification, if any
- id_tree_node.get_branches(): returns a dict mapping node's child branches to the nodes at the end of them
- id_tree_node.describe_type(): returns a string describing the node type (leaf, subtree, etc)