Lab 6: KNNs & ID Trees

From 6.034 Wiki

(Difference between revisions)
Jump to: navigation, search
(Python warm-up: Distance metrics)
Current revision (04:23, 21 August 2020) (view source)
(Breaking down a problem into smaller sub-problems)
 
(168 intermediate revisions not shown.)
Line 1: Line 1:
 +
<!-- {{Unreleased}} -->
__TOC__
__TOC__
-
This lab is due by Friday, October 21 at 10:00pm.
+
{{Lab_Due|when=Thursday, October 29}}
-
(todo put code online)
+
{{Get_Code|lab=6}}
-
<!--
+
-
To work on this lab, you will need to get the code, much like you did for the previous labs.  You can:
+
-
* Use Git on your computer: <tt>git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/lab5</tt>
+
-
* Use Git on Athena: <tt>git clone /mit/6.034/www/labs/lab5</tt>
+
-
* Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip
+
-
* View the files individually: http://web.mit.edu/6.034/www/labs/lab5/
+
-
-->
+
-
Your answers for this lab belong in the main file <tt>lab5.py</tt>.  This lab covers k-nearest neighbors and identification trees.
+
-
== k-Nearest Neighbors ==
+
== Breaking down a problem into smaller sub-problems ==
-
=== Multiple Choice ===
+
For your convenience, we have split this lab into two sections. Part 1 covers the topic of [[Lab_5:_Identification_Trees | identification trees]], and Part 2 covers [[Lab_5:_k-Nearest_Neighbors | k-nearest neighbors]].
-
todo: something about drawing 1NN boundaries
+
-
=== Python warm-up: Distance metrics ===
+
We strongly suggest that you complete [[Lab_6:_Identification_Trees | Part 1 (ID Trees)]] by '''Thursday, October 15'''.
-
k-nearest neighbors can use many different distance metrics.  In 6.034, we cover four of them:
+
-
* Euclidean distance: the straight-line distance between two points (todo add formula)
+
-
* Manhattan distance: todo explain (todo add formula)
+
-
* Hamming distance: todo explain (todo add formula)
+
-
* Cosine distance: todo explain (todo add formula) (todo note about cosine_distance being negated, b/c it's a similarity metric where 1 = most similar, 0 = not similar)
+
-
It's also possible to transform data instead of using a different metric -- for instance, transforming to polar coordinates -- but in this lab we will only work with distance metrics in Cartesian coordinates.
+
We strongly suggest that you then complete [[Lab_6:_k-Nearest_Neighbors | Part 2 (kNN)]] by '''Tuesday, October 20'''.
-
We'll start with some basic functions for manipulating vectors, which may be useful for <tt>cosine_distance</tt>. For these, we will represent an ''n''-dimensional vector as a list or tuple of ''n'' coordinates.
+
The final deadline to complete Lab 6 (both [[Lab_6:_Identification_Trees | Part 1 (ID Trees)]]  and [[Lab_6:_k-Nearest_Neighbors | Part 2 (kNN)]]) is '''Thursday, October 29'''.  
-
<tt>dot_product</tt> should compute the dot product of two vectors, while <tt>norm</tt> computes the length of a vector.  (There is a simple implementation of <tt>norm</tt> that uses <tt>dot_product</tt>.)  Implement both functions:
+
-
def dot_product(u, v):
+
Note that if you submit online after completing Part 1 (but before completing Part 2), the tester will fail as expected for kNN tests. Don't fret! '''As long as you submit and pass all online tests by Thursday, October 29, you will be awarded full lab credit.'''
-
def norm(v):
+
{{Survey}}
-
 
+
-
 
+
-
Next, you'll implement each of the four distance metrics.  For the rest of the k-nearest neighbors portion of this lab, we will represent data points as [[#Point_(for_kNN)|Point]] objects (described below in the API).  Each distance function should take in two Points and return the distance between them as a float or int.  Implement each distance function:
+
-
 
+
-
def euclidean_distance(point1, point2):
+
-
 
+
-
def manhattan_distance(point1, point2):
+
-
 
+
-
def hamming_distance(point1, point2):
+
-
 
+
-
def cosine_distance(point1, point2):
+
-
 
+
-
=== Classifying Points with k-Nearest Neighbors ===
+
-
 
+
-
We've provided a function: (todo explain)
+
-
 
+
-
<pre>
+
-
def get_k_closest_points(point, data, k, distance_metric):
+
-
    "return list of k points closest to input point, based on metric"
+
-
    if k >= len(data):
+
-
        return data
+
-
    sorted_points = sorted(data, key=lambda p2: distance_metric(point, p2))
+
-
    return sorted_points[:k]
+
-
</pre>
+
-
 
+
-
Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):
+
-
def knn_classify_point(point, data, k, distance_metric):
+
-
 
+
-
todo hint: to get the mode of a list, do: max(set(my_list), key=my_list.count)
+
-
 
+
-
=== Cross-validation: Choosing the best k and distance metric ===
+
-
 
+
-
todo explain functions
+
-
 
+
-
todo clarify different definitions of cross-validation; we're using leave-one-out
+
-
 
+
-
def cross_validate(data, k, distance_metric):
+
-
 
+
-
 
+
-
def find_best_k_and_metric(data):
+
-
 
+
-
=== More multiple choice ===
+
-
todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens
+
-
 
+
-
== Identification Trees ==
+
-
 
+
-
=== Using an ID Tree to classify unknown points ===
+
-
todo explain function
+
-
 
+
-
def id_tree_classify_point(point, id_tree):
+
-
 
+
-
See API.  In particular, the methods .apply_classifier() and .get_node_classification() may be useful.
+
-
 
+
-
This function can be written cleanly in 3-4 lines, either iteratively or recursively.
+
-
 
+
-
=== Calculating Disorder ===
+
-
todo explain functions
+
-
 
+
-
def branch_disorder(branch):
+
-
 
+
-
python hints (todo): log2, list.count, float, set
+
-
 
+
-
 
+
-
def average_test_disorder(stump):
+
-
 
+
-
=== Constructing an ID Tree ===
+
-
todo explain functions
+
-
 
+
-
def find_best_classifier(data, possible_classifiers, classification_key):
+
-
 
+
-
todo python hint INF
+
-
 
+
-
def add_best_classifier_to_tree(id_tree_node, possible_classifiers):
+
-
 
+
-
def finish_subtree(id_tree_node, possible_classifiers):
+
-
 
+
-
Finally, to put it all together, we've provided a function that calls your functions to construct a complete ID tree:
+
-
 
+
-
<pre>
+
-
def construct_greedy_identification_tree(data, possible_classifiers, classification_key):
+
-
    id_tree = IdentificationTreeNode(data, classification_key)
+
-
    finish_subtree(id_tree, possible_classifiers)
+
-
    return id_tree
+
-
</pre>
+
-
 
+
-
=== Conceptual questions ===
+
-
todo
+
-
 
+
-
== Survey ==
+
-
Please answer these questions at the bottom of your <tt>lab5.py</tt> file:
+
-
 
+
-
* <tt>NAME</tt>: What is your name? (string)
+
-
 
+
-
* <tt>COLLABORATORS</tt>: Other than 6.034 staff, with whom did you work on this lab? (string, or empty string if you worked alone)
+
-
 
+
-
* <tt>HOW_MANY_HOURS_THIS_LAB_TOOK</tt>: Approximately how many hours did you spend on this lab? (number or string)
+
-
 
+
-
* <tt>WHAT_I_FOUND_INTERESTING</tt>: Which parts of this lab, if any, did you find interesting? (string)
+
-
 
+
-
* <tt>WHAT_I_FOUND_BORING</tt>: Which parts of this lab, if any, did you find boring or tedious? (string)
+
-
 
+
-
* (optional) <tt>SUGGESTIONS</tt>: What specific changes would you recommend, if any, to improve this lab for future years? (string)
+
-
 
+
-
 
+
-
(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)
+
-
 
+
-
When you're done, run the online tester to submit your code.  <!--(The online tester for Lab 5 will be made available by ___.)-->
+
-
 
+
-
== API Reference Documentation ==
+
-
The file <tt>api.py</tt> defines the <tt>Point</tt>, <tt>Classifier</tt>, and <tt>IdentificationTreeNode</tt> classes, as well as some helper functions for ID trees, all described below.  (todo: describe helper functions)
+
-
 
+
-
=== Point (for kNN) ===
+
-
<tt>Point</tt> objects are used in the k-nearest neighbors section only.
+
-
 
+
-
A <tt>Point</tt> has the following attributes:
+
-
*<b><tt>point</tt></b><tt>.name</tt>, the name of the point (a string), if defined.
+
-
*<b><tt>point</tt></b><tt>.coords</tt>, the coordinates of the point, represented as a vector (a tuple or list of numbers).
+
-
*<b><tt>point</tt></b><tt>.classification</tt>, the classification of the point, if known.
+
-
 
+
-
=== Classifier ===
+
-
<tt>Classifier</tt> objects are used for constructing and manipulating ID trees.
+
-
 
+
-
A <tt>Classifier</tt> has the following attributes:
+
-
*<b><tt>classifier</tt></b><tt>.name</tt>, the name of the classifier.
+
-
*<b><tt>classifier</tt></b><tt>.classify</tt>, a function that takes in a point and returns a value.
+
-
 
+
-
In our ID trees, a point is represented as a dict mapping attribute names to their values.
+
-
 
+
-
=== IdentificationTreeNode ===
+
-
In this lab, an ID tree is represented recursively as a tree of <tt>IndentificationTreeNode</tt> objects.  In a completed ID tree, each node has either a classification (such as "Oak" or "Maple"), or a classifier (such as "has leaves") with branches leading to child nodes.
+
-
 
+
-
An <tt>IndentificationTreeNode</tt> has the following attributes and methods:
+
-
*<b><tt>id_tree_node</tt></b><tt>.data</tt>, the training points at this node in the tree.
+
-
*<b><tt>id_tree_node</tt></b><tt>.classification_key</tt>, the attribute by which the tree classifies points (e.g. "tree_type" for the tree data, or "Classification" for the angel data).
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_parent_branch_name()</tt>: returns the name of the branch leading to this node, or "ID Tree" if it is the root node
+
-
*<b><tt>id_tree_node</tt></b><tt>.is_leaf()</tt>: returns True if the node is a leaf (has a classification), otherwise False
+
-
*<b><tt>id_tree_node</tt></b><tt>.set_classifier(classifier)</tt>: Uses the specified classifier to add branches below the current node.  Modifies and returns the original id_tree_node.  May print warnings if the specified classifier is inadvisable.
+
-
*<b><tt>id_tree_node</tt></b><tt>.apply_classifier(point)</tt>: Applies classifier to point by following the appropriate branch of the tree, then returns a child node.  If node is leaf node (and thus doesn't have a classifier), raises error.
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_classifier()</tt>: returns the classifier, if any
+
-
*<b><tt>id_tree_node</tt></b><tt>.set_node_classification(classification)</tt>: Sets the node's classification, which defines the node as a leaf node.  Modifies and returns original id_tree_node.  May print warnings if the node already has branches defined.
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_node_classification()</tt>: returns the node's classification, if any
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_branches()</tt>: returns a dict mapping node's child branches to the nodes at the end of them <!--todo explain better-->
+
-
*<b><tt>id_tree_node</tt></b><tt>.describe_type()</tt>: returns a string describing the node type (leaf, subtree, etc)
+

Current revision

Contents


This lab is due by Thursday, October 29 at 10:00pm.

Before working on the lab, you will need to get the code. You can...

  • Use Git on your computer: git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/fall2020/lab6
  • Use Git on Athena: git clone /mit/6.034/www/labs/fall2020/lab6


All of your answers belong in the main file lab6.py. To submit your lab to the test server, you will need to download your key.py file and put it in either your lab6 directory or its parent directory. You can also view all of your lab submissions and grades here.


Breaking down a problem into smaller sub-problems

For your convenience, we have split this lab into two sections. Part 1 covers the topic of identification trees, and Part 2 covers k-nearest neighbors.

We strongly suggest that you complete Part 1 (ID Trees) by Thursday, October 15.

We strongly suggest that you then complete Part 2 (kNN) by Tuesday, October 20.

The final deadline to complete Lab 6 (both Part 1 (ID Trees) and Part 2 (kNN)) is Thursday, October 29.

Note that if you submit online after completing Part 1 (but before completing Part 2), the tester will fail as expected for kNN tests. Don't fret! As long as you submit and pass all online tests by Thursday, October 29, you will be awarded full lab credit.

Survey

Please answer these questions at the bottom of your lab file:

  • NAME: What is your name? (string)
  • COLLABORATORS: Other than 6.034 staff, whom did you work with on this lab? (string, or empty string if you worked alone)
  • HOW_MANY_HOURS_THIS_LAB_TOOK: Approximately how many hours did you spend on this lab? (number or string)
  • WHAT_I_FOUND_INTERESTING: Which parts of this lab, if any, did you find interesting? (string)
  • WHAT_I_FOUND_BORING: Which parts of this lab, if any, did you find boring or tedious? (string)
  • (optional) SUGGESTIONS: What specific changes would you recommend, if any, to improve this lab for future years? (string)


(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)

When you're done, run the online tester to submit your code.

Personal tools