Lab 6: KNNs & ID Trees

From 6.034 Wiki

(Difference between revisions)
Jump to: navigation, search
m (Splitting Data with a Classifier)
Current revision (04:23, 21 August 2020) (view source)
(Breaking down a problem into smaller sub-problems)
 
(93 intermediate revisions not shown.)
Line 1: Line 1:
 +
<!-- {{Unreleased}} -->
__TOC__
__TOC__
-
This lab is due by Friday, October 21 at 10:00pm.
+
{{Lab_Due|when=Thursday, October 29}}
-
(todo put code online)
+
{{Get_Code|lab=6}}
-
<!--
+
-
To work on this lab, you will need to get the code, much like you did for the previous labs.  You can:
+
-
* Use Git on your computer: <tt>git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/lab5</tt>
+
-
* Use Git on Athena: <tt>git clone /mit/6.034/www/labs/lab5</tt>
+
-
* Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab5/lab5.zip
+
-
* View the files individually: http://web.mit.edu/6.034/www/labs/lab5/
+
-
-->
+
-
Your answers for this lab belong in the main file <tt>lab5.py</tt>. This lab will cover identification trees as well as k-nearest neighbors.
+
 +
== Breaking down a problem into smaller sub-problems ==
-
== Identification Trees ==
+
For your convenience, we have split this lab into two sections. Part 1 covers the topic of [[Lab_5:_Identification_Trees | identification trees]], and Part 2 covers [[Lab_5:_k-Nearest_Neighbors | k-nearest neighbors]].
-
In this section of the lab, we will first warm up by using an ID tree to classify points. Then, we will learn how to construct new ID trees from raw data, using the techniques we learned in class.
+
-
Note: We strongly recommend reading the [[#API_Reference_Documentation | API section]] before writing any code. Otherwise, you may find yourself re-writing code that needn't be written again.
+
We strongly suggest that you complete [[Lab_6:_Identification_Trees | Part 1 (ID Trees)]] by '''Thursday, October 15'''.
-
=== Using an ID Tree to classify unknown points ===
+
We strongly suggest that you then complete [[Lab_6:_k-Nearest_Neighbors | Part 2 (kNN)]] by '''Tuesday, October 20'''.
-
In this lab, we will represent an ID tree recursively as a tree of <tt>IdentificationTreeNode</tt> objects.  For more information, see the [[#IdentificationTreeNode | API]] below.
+
-
For the ID trees section, we will represent a data point as a dictionary mapping attributes to values. For example, the following could represent a data point in our vampires example from lecture:
+
The final deadline to complete Lab 6 (both [[Lab_6:_Identification_Trees | Part 1 (ID Trees)]] and [[Lab_6:_k-Nearest_Neighbors | Part 2 (kNN)]]) is '''Thursday, October 29'''.
-
  datapoint1 = {"name":"person1", "Vampire?":"Vampire", "Shadow":"?", "Eats Garlic":"No", "Complexion":"Ruddy", "Accent":"Odd"}
+
-
To begin, let's use an ID tree to classify a point! Classifying a point using an ID tree is straight-forward. We recursively apply the current node's classifier to the data point, using the result to choose which branch to take to the child node. If a "leaf" node is ever encountered, that node gives the final classification of the point.
+
Note that if you submit online after completing Part 1 (but before completing Part 2), the tester will fail as expected for kNN tests. Don't fret! '''As long as you submit and pass all online tests by Thursday, October 29, you will be awarded full lab credit.'''
-
Write a function <tt>id_tree_classify_point</tt> that takes in a point (represented as a dictionary) and an ID tree (represented as a <tt>IdentificationTreeNode</tt>) and uses the ID tree to classify the point (even if the point already has a classification defined), returning the final classification:
+
{{Survey}}
-
 
+
-
def id_tree_classify_point(point, id_tree):
+
-
 
+
-
This function can be written cleanly in 3-4 lines, either iteratively or recursively.
+
-
If you're not sure how, check out the available methods in the [[#IdentificationTreeNode | <tt>IdentificationTreeNode</tt> API]].  In particular, the methods <tt>.apply_classifier(point)</tt> and <tt>.get_node_classification()</tt> may be useful.
+
-
 
+
-
=== Splitting Data with a Classifier ===
+
-
In order to actually construct an ID tree, we'll need to work directly with classifiers, also known as tests. Recall that at each node of an ID tree, unless the training data at this node is homogeneous, we pick the best available classifier to split the data up. This same logic is repeated at every level of the tree, with each node taking as training data the data points sorted into that category by the node above it.
+
-
 
+
-
We will represent each classifier as a <tt>Classifier</tt> object, described in the [[#Classifier | API]] below.
+
-
 
+
-
todo: explain how to create a classifier using feature_test or threshold_test
+
-
 
+
-
(todo explain more?) Given a particular Classifier and a list of points, return a dictionary mapping each possible feature to a list of the data in its branch.
+
-
 
+
-
def split_on_classifier(data, classifier):
+
-
 
+
-
=== Calculating Disorder ===
+
-
todo: change explanations below b/c disorder fns now take in lists of points and Classifier objects
+
-
 
+
-
Now that we're able to use an ID tree to classify points, it's time to actually construct an ID tree yourself!  We'll start by coding the information-disorder equations to calculate the disorder of a branch or test. However, before we start, we'll give a quick description of the nomenclature we'll be using:
+
-
* A '''branch''' is a Python list of ''classifications'' representing the points that, upon applying a particular classifier (e.g. "Eats garlic") to a set of many points, all have the same feature result. For example, the branch <tt>["Vampire", "Vampire", "Vampire", "Not Vampire", "Not Vampire"]</tt> could refer to the five data points that answered <tt>"No"</tt> when asked if they ate garlic: the first three all happen to be vampires, but the last two are not.
+
-
* A '''stump''' (or '''decision stump''') is the list of ''all branches'' for a particular classifier. For example, the stump <tt>[["Vampire", "Vampire", "Vampire", "Not Vampire", "Not Vampire"], ["Not Vampire", "Not Vampire", "Not Vampire"]]</tt> could refer to the results from applying the "Eats garlic" classifier to all of the data points; note that the first element of this stump is the branch for <tt>"No"</tt> and the second element of this stump is the branch for <tt>"Yes"</tt>.
+
-
* A '''test''' (occasionally we overload the term '''classifier''' here as well) can be thought of as the conceptual manifestation of a decision stump. They basically refer to the same thing.
+
-
 
+
-
Now, recall that the information disorder of a branch represents how different the values are in that branch. A branch whose constituent classifications are all the same (e.g. <tt>[1, 1, 1]</tt>) is ''homogeneous'', and has a disorder of 0. On the other hand, a branch with many different values has a very high disorder. For example, the branch <tt>[1, 1, 2, 2]</tt> has a disorder of 1, and the branch <tt>[1, 2, 3]</tt> has a disorder of roughly 1.58.
+
-
 
+
-
The formula we use for disorder originates in information theory, and is described in detail on page 429 of [http://courses.csail.mit.edu/6.034f/ai3/ch21.pdf the reading].
+
-
 
+
-
(todo update this) With this in mind, you will now code <tt>branch_disorder</tt>. <tt>branch_disorder</tt> should take in a '''branch''' of a decision stump, represented as a list of classifications, and return the disorder of the branch, as a number:
+
-
 
+
-
def branch_disorder(data, target_classifier):
+
-
 
+
-
For example, calling <tt>branch_disorder(["Oak", "Oak", "Maple"])</tt> should yield approximately <tt>0.918</tt>. todo update!
+
-
 
+
-
The following Python tricks and shortcuts may be useful here and/or later in the lab:
+
-
* <tt>log2</tt>: We've defined a function <tt>log2</tt> that takes in a single number and returns log<sub>2</sub> of the number.
+
-
* <tt>INF</tt>: As in previous labs, we've defined the constant <tt>INF</tt>, which is a float representing infinity
+
-
* <tt>float</tt>: Recall that if you divide two ints in Python2, it rounds down to the nearest int, so you'll need to cast one of them to a float if you want to perform normal division.
+
-
 
+
-
(todo move these hints down to knn_classify_point:)
+
-
* <tt>list.count</tt>: <tt>.count</tt> is a built-in list method that counts how many times an item occurs in a list.  For example, <tt>[10, 20, 20, 30].count(20)</tt> -> <tt>2</tt>.
+
-
* <tt>set</tt>: Recall from Lab 0 that a set is handy way to count or enumerate the unique items in a list.
+
-
 
+
-
Next, you will use your <tt>branch_disorder</tt> function to help compute the disorder of an entire test (a decision stump). Recall that the disorder of a test is the weighted average of the disorders of the constituent branches, where the weights are decided by how many elements are in each branch.
+
-
 
+
-
(todo update) <tt>average_test_disorder</tt> should take in a '''decision stump''' represented as a list of branches, and return the disorder of the entire stump, as a number:
+
-
 
+
-
def average_test_disorder(data, test_classifier, target_classifier):
+
-
 
+
-
(todo update) For example, calling <tt>average_test_disorder([["Oak", "Oak", "Maple"], ["Maple", "Maple"]])</tt> should yield approximately <tt>0.551</tt> since the the first branch has weight <tt>3/5</tt> and a disorder of about <tt>0.915</tt>, and the second branch has a weight of <tt>2/5</tt> with a disorder of <tt>0</tt>.
+
-
 
+
-
=== Constructing an ID Tree ===
+
-
Using the disorder functions you defined above, implement a function to select the best classifier.  The function should take in some data (as a list of point dictionaries), a list of classifiers (<tt>Classifier</tt> objects), and the target classifier that you ultimately want your ID tree to be classifying by (e.g. a Classifier representing the test "Vampire?").  The function should return the classifier that has the lowest disorder. 
+
-
 
+
-
Edge cases:
+
-
* If multiple classifiers are tied for lowest disorder, break ties by preferring the classifier that occurs earlier in the list.
+
-
* If the classifier with lowest disorder wouldn't separate the data at all (that is, the classifier has only one branch), [[#How_to_handle_exceptions|raise the exception]] <tt>NoGoodClassifiersError</tt> instead of returning a classifier.
+
-
 
+
-
def find_best_classifier(data, possible_classifiers, target_classifier):
+
-
 
+
-
Now, it is time to build an ID tree!
+
-
 
+
-
The function <tt>construct_greedy_id_tree</tt> should take in four arguments:
+
-
*<tt>data</tt>: a list of points
+
-
*<tt>possible_classifiers</tt>: a list of Classifier objects that you can use in constructing your tree
+
-
*<tt>classification_key</tt>: the attribute to classify by, as a string (e.g. "Vampire?")
+
-
*<tt>id_tree_node</tt> (optional): an incomplete tree, represented as an <tt>IdentificationTreeNode</tt>. If the incomplete tree is provided, finish it; if it is not provided, create a new tree using the constructor: <tt>id_tree_node = IdentificationTreeNode(classification_key)</tt>
+
-
Then, add classifiers and classifications to the tree until either perfect classification has been achieved, or there are no good classifiers left.  We recommend implementing this function recursively (which can be done cleanly in 16 lines).
+
-
 
+
-
More specifically:
+
-
:1. Once the current id_tree_node is defined, perform one of three actions, depending on the input node's data and available classifiers:
+
-
:* If the node is homogeneous (you can use the function <tt>is_homogeneous(data, classification_key)</tt> to check), then it should be a leaf node, so add the classification to the node.
+
-
:* If the node is not homogeneous and the data can be divided further, add the best classifier to the node.
+
-
:* If the node is not homogeneous but there are no good classifiers left (i.e. no classifiers with more than one branch), leave the node's classification unassigned (which defaults to <tt>None</tt>).
+
-
:2. If you added a classifier to the node, use recursion to complete each subtree.
+
-
:3. Return the original input node.
+
-
 
+
-
def construct_greedy_id_tree(data, possible_classifiers, classification_key, id_tree_node=None):
+
-
 
+
-
Congrats, you're done constructing ID trees!
+
-
 
+
-
We've provided some datasets from past quizzes, so you can now use your ID tree builder to solve problems!  For example, if you run
+
-
print construct_greedy_id_tree(tree_data, tree_classifiers, "tree_type")
+
-
...it should compute and print the solution to the tree-identification problem from 2014 Q2.
+
-
 
+
-
You can also try:
+
-
print construct_greedy_id_tree(angel_data, angel_classifiers, "Classification")  #from 2012 Q2
+
-
print construct_greedy_id_tree(numeric_data, numeric_classifiers, "class")  #from 2013 Q2
+
-
 
+
-
You can also change the classification_key attribute to, for example, use tree_type to predict what type of bark_texture a tree has:
+
-
print construct_greedy_id_tree(tree_data, tree_classifiers_reverse, "bark_texture")  #build an ID tree to predict bark_texture
+
-
 
+
-
=== Multiple Choice ===
+
-
 
+
-
==== Questions 1-4: Identification (of) Trees ====
+
-
These questions refer to the data from 2014 Quiz 2, Problem 1, Part A, which is stored in the variables <tt>tree_data</tt> (the data) and <tt>tree_classifiers</tt> (a list of Classifiers).
+
-
 
+
-
'''Question 1:''' Which of the four classifiers (a.k.a. feature tests) has the lowest disorder?  Fill in ANSWER_1 with one of the four classifiers as a string: 'has_leaves', 'leaf_shape', 'orange_foliage', or 'tree_type'.  <!--If multiple classifiers are tied for having the lowest disorder, instead fill in ANSWER_1 with a list of classifier names.-->
+
-
 
+
-
'''Question 2:''' Which of the four classifiers has the ''second'' lowest disorder?  Fill in ANSWER_2 with one of the four classifiers as a string.
+
-
 
+
-
'''Question 3:''' If we start constructing the greedy, disorder-minimizing ID tree, we'll start with the classifier from Question 1.  Our one-classifier tree has exactly one non-homogeneous branch.  For that branch, which of the four classifiers has the lowest disorder?  Fill in ANSWER_3 with one of the four classifiers as a string.
+
-
 
+
-
'''Question 4:''' Is the test with the ''second'' lowest disorder in the first round always the same as the test with the lowest disorder in the second round?  Answer 'Yes' or 'No' in ANSWER_4.
+
-
 
+
-
 
+
-
==== Questions 5-9: XOR ====
+
-
 
+
-
Each training point in the dataset below has three binary features (A, B, C) and a binary classification (0 or 1).  Note that the classification happens to be the boolean function XOR(A,B), while feature C is just random noise.  The dataset is available in the lab as <tt>binary_data</tt>, and the three classifiers are stored in the variable <tt>binary_classifiers</tt>.
+
-
 
+
-
{| cellpadding=5 border=1 cellspacing=0 style="text-align: center;"
+
-
|- bgcolor=#eeeeee
+
-
! Classification !! A !! B !! C
+
-
|-
+
-
| 0 || 0 || 0 || 0
+
-
|-
+
-
| 0 || 0 || 0 || 1
+
-
|-
+
-
| 1 || 0 || 1 || 0
+
-
|-
+
-
| 1 || 0 || 1 || 1
+
-
|-
+
-
| 1 || 1 || 0 || 1
+
-
|-
+
-
| 1 || 1 || 0 || 1
+
-
|-
+
-
| 0 || 1 || 1 || 0
+
-
|}
+
-
 
+
-
Consider the three identification trees below.  They are defined in the lab code as <tt>binary_tree_n</tt>, for <tt>n</tt> from 1 to 3.  To answer the questions below, you may use your code or work out the answers by hand.
+
-
[[Image:Lab5_binary_trees.png]]
+
-
 
+
-
'''Question 5''': Which of the trees correctly classify all of the training data?  Fill in ANSWER_5 with a list of numbers (e.g. [1,2] if trees 1 and 2 correctly classify the training data but tree 3 does not).
+
-
 
+
-
'''Question 6''': Which of the trees could be created by the greedy, disorder-minimizing ID tree algorithm?  Fill in ANSWER_6 with a list of numbers.
+
-
 
+
-
'''Question 7''': Which trees correctly compute the function XOR(A,B)?  Fill in ANSWER_7 with a list of numbers.
+
-
 
+
-
'''Question 8''': Based on Occam's Razor, which tree (among the ones that correctly classify the data) is likely to give the most accurate results when classifying unknown test points?  Fill in ANSWER_8 with a single number, as an int.
+
-
 
+
-
'''Question 9''': Will the greedy, disorder-minimizing tree always be the simplest possible tree to correctly classify the data?  Answer 'Yes' or 'No' in ANSWER_9.
+
-
 
+
-
 
+
-
== k-Nearest Neighbors ==
+
-
 
+
-
 
+
-
 
+
-
=== Multiple Choice: Drawing boundaries ===
+
-
Below are images of 3 different data sets in Cartesian space, with numerous ways of drawing decision boundaries for each.  For each image, answer following question by filling in ANSWER_n with an int 1, 2, 3, or 4:
+
-
 
+
-
:Which method could create the decision boundaries drawn in the image?
+
-
:#Greedy, disorder-minimizing ID tree, using only tests of the form "X > threshold" or "Y > threshold"
+
-
:#1-nearest neighbors using Euclidean distance
+
-
:#Both
+
-
:#Neither
+
-
 
+
-
(todo: choose a subset of the images; number them to match ANSWER_n; typeset)
+
-
 
+
-
[[Image:Lab5_decision_boundary_images.jpg]]
+
-
 
+
-
=== Python warm-up: Distance metrics ===
+
-
We can use many different distance metrics with k-nearest neighbors.  In 6.034, we cover four such metrics:
+
-
* '''Euclidean distance''': the straight-line distance between two points. (todo add formula)
+
-
* '''Manhattan distance''': the distance between two points assuming you can only move parallel to axes. In two dimensions, you can conceptualize this as only being able to move along the "grid lines." (todo add formula)
+
-
* '''Hamming distance''': the number of differing corresponding components between two vectors. (todo add formula)
+
-
* '''Cosine distance''': compares the angle between two vectors, where each point is represented as a vector from the origin to the point's coordinates.  Cosine distance is typically a similarity measure, not a difference measure, because cos(0)=1 means "perfectly similar", cos(90 deg)=0 means "perfectly orthogonal", and cos(180 deg)=-1 means "maximally dissimilar" (or "perfectly opposite").  For our code, however, we need cosine to be a difference measure so that smaller distance means "more similar", to match all the other distance metrics.  Accordingly, we will define cosine distance to be 1 minus the cosine of the angle between two vectors, so that cosine distance varies between 0 (perfectly similar) and 2 (maximally dissimilar).  (todo add formula w/ dot_product and norm)
+
-
 
+
-
It's also possible to transform data instead of using a different metric -- for instance, transforming to polar coordinates -- but in this lab we will only work with distance metrics in Cartesian coordinates.
+
-
 
+
-
We'll start with some basic functions for manipulating vectors, which may be useful for <tt>cosine_distance</tt>.  For these, we will represent an ''n''-dimensional vector as a list or tuple of ''n'' coordinates. 
+
-
<tt>dot_product</tt> should compute the dot product of two vectors, while <tt>norm</tt> computes the length of a vector.  (There is a simple implementation of <tt>norm</tt> that uses <tt>dot_product</tt>.)  Implement both functions:
+
-
 
+
-
def dot_product(u, v):
+
-
 
+
-
def norm(v):
+
-
 
+
-
 
+
-
Next, you'll implement each of the four distance metrics.  For the rest of the k-nearest neighbors portion of this lab, we will represent data points as [[#Point_(for_kNN)|Point]] objects (described below in the API).  Each distance function should take in two Points and return the distance between them as a float or int.  Implement each distance function:
+
-
 
+
-
def euclidean_distance(point1, point2):
+
-
 
+
-
def manhattan_distance(point1, point2):
+
-
 
+
-
def hamming_distance(point1, point2):
+
-
 
+
-
def cosine_distance(point1, point2):
+
-
 
+
-
=== Classifying Points with k-Nearest Neighbors ===
+
-
 
+
-
Now that we have defined four distance metrics, we will use them to classify points using k-nearest neighbors with various values of k.
+
-
 
+
-
First, write a function <tt>get_k_closest_points</tt> that takes in a point (a Point object), some data (a list of Point objects), a number k, and a distance metric (a function, such as euclidean_distance), and returns the k points in the data that are nearest to the input point, according to the distance metric.  Break ties lexicographically using the points' coordinates (i.e. in case of a tie, sort by point.coords).  This function can be written cleanly in 5 lines or less.
+
-
 
+
-
def get_k_closest_points(point, data, k, distance_metric):
+
-
 
+
-
'''A note about ties:''' There are two ways to get ties when classifying points using k-nearest neighbors. 
+
-
#If two or more training points are equidistant to the test point, the "k nearest" points may be undefined.  If the equidistant points have the same classification, it doesn't make a difference, but in some cases they could cause the classification to be ambiguous.  In this lab, these cases are handled in <tt>get_k_closest_points</tt> by breaking ties lexicographically.
+
-
#If the two most likely classifications are equally represented among the k nearest training points, the classification is ambiguous.  In other words, the k-nearest neighbors are voting on a classification, and if the vote is tied, there needs to be some mechanism for breaking the tie.  In this lab, we will assume that there are no voting ties, so you don't need to worry about tie-breaking.  (All of the test cases in this section are designed to ensure that there will not be any voting ties.)
+
-
 
+
-
Your task is to use that to write a function that classifies a point, given a set of data (as a list of Points), a value of k, and a distance metric (a function):
+
-
def knn_classify_point(point, data, k, distance_metric):
+
-
 
+
-
Python hint: To get the mode of a list, there's a neat one-line trick using <tt>max</tt> with the <tt>key</tt> argument, the <tt>list.count</tt> function, and converting a list to a set to get the unique elements. <!--max(set(my_list), key=my_list.count)-->
+
-
 
+
-
=== Cross-validation: Choosing the best k and distance metric ===
+
-
 
+
-
There are many possible implementations of cross validation, but the general idea is to separate your data into a training set and a test set, then use the test set to determine how well the training parameters worked.  For k-nearest neighbors, this means choosing a value of k and a distance metric, then testing each point in the test set to see whether they are classified correctly.  The "cross" part of cross-validation comes from the idea that you can re-separate your data multiple times, so that different subsets of the data take turns being in the training and test sets, respectively.
+
-
 
+
-
For the next function, you'll implement a specific type of cross-validation known as leave-one-out cross-validation.  If there are n points in the data set, you'll separate the data n times into a training set of size n-1 (all the points except one) and a test set of size one (the one point being left out).  Implement the function <tt>cross_validate</tt>, which takes in a set of data, a value of k, and a distance metric, and returns the fraction of points that were classified correctly using leave-one-out cross-validation.
+
-
 
+
-
def cross_validate(data, k, distance_metric):
+
-
 
+
-
Use your cross-validation method to write a function that takes in some data (a list of Points) and tries every combination of reasonable values of k and the four distance metrics we defined above, then returns a tuple <tt>(k, distance_metric)</tt> representing the value of k and the distance metric that produce the best results with cross validation:
+
-
def find_best_k_and_metric(data):
+
-
 
+
-
=== More multiple choice ===
+
-
todo something about overfitting, underfitting, running classify and find_best_k... on some data to see what happens
+
-
 
+
-
== Survey ==
+
-
Please answer these questions at the bottom of your <tt>lab5.py</tt> file:
+
-
 
+
-
* <tt>NAME</tt>: What is your name? (string)
+
-
 
+
-
* <tt>COLLABORATORS</tt>: Other than 6.034 staff, with whom did you work on this lab? (string, or empty string if you worked alone)
+
-
 
+
-
* <tt>HOW_MANY_HOURS_THIS_LAB_TOOK</tt>: Approximately how many hours did you spend on this lab? (number or string)
+
-
 
+
-
* <tt>WHAT_I_FOUND_INTERESTING</tt>: Which parts of this lab, if any, did you find interesting? (string)
+
-
 
+
-
* <tt>WHAT_I_FOUND_BORING</tt>: Which parts of this lab, if any, did you find boring or tedious? (string)
+
-
 
+
-
* (optional) <tt>SUGGESTIONS</tt>: What specific changes would you recommend, if any, to improve this lab for future years? (string)
+
-
 
+
-
 
+
-
(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)
+
-
 
+
-
When you're done, run the online tester to submit your code.  <!--(The online tester for Lab 5 will be made available by ___.)-->
+
-
 
+
-
== API Reference Documentation ==
+
-
The file <tt>api.py</tt> defines the <tt>Classifier</tt>, <tt>IdentificationTreeNode</tt>, and <tt>Point</tt> classes, as well as some helper functions for ID trees, all described below.  (todo: describe helper functions -- jake: I added the section and half-wrote it below)
+
-
 
+
-
=== Classifier ===
+
-
<tt>Classifier</tt> objects are used for constructing and manipulating ID trees.
+
-
 
+
-
A <tt>Classifier</tt> has the following attributes:
+
-
*<b><tt>classifier</tt></b><tt>.name</tt>, the name of the classifier.
+
-
*<b><tt>classifier</tt></b><tt>.classify</tt>, a function that takes in a point and returns a value.
+
-
 
+
-
In our ID trees, a point is represented as a dict mapping attribute names to their values.
+
-
 
+
-
=== IdentificationTreeNode ===
+
-
In this lab, an ID tree is represented ''recursively'' as a tree of <tt>IdentificationTreeNode</tt> objects. In particular, an <tt>IdentificationTreeNode</tt> object fully represents an entire ID tree rooted at that node.
+
-
 
+
-
For example, suppose we have an <tt>IdentificationTreeNode</tt> called <tt>id_tree_node</tt> Then, <tt>id_tree_node</tt>'s children are themselves <tt>IdentificationTreeNode</tt> objects, each fully describing the sub-trees of <tt>id_tree_node</tt>. However, if <tt>id_tree_node</tt> has no children, then <tt>id_tree_node</tt> is a leaf, meaning that it represents a homogeneous (by classification) set of data points. Furthermore, any datum at this node is definitively classified by that leaf's classification.
+
-
 
+
-
As such, in a completed ID tree, each node is either
+
-
* a leaf node with a ''classification'' such as <tt>"Vampire"</tt> or <tt>"Not Vampire"</tt> in the vampires example; or
+
-
* a non-leaf node with a ''classifier'' (such as <tt>"Accent"</tt> in the vampires example), with branches (one per classifier result, e.g. <tt>"Heavy"</tt>, <tt>"Odd"</tt>, and <tt>"None"</tt>) leading to child nodes.
+
-
 
+
-
An <tt>IdentificationTreeNode</tt> has the following attributes and methods:
+
-
*<b><tt>id_tree_node</tt></b><tt>.data</tt>, the list of training points at this node in the tree.
+
-
*<b><tt>id_tree_node</tt></b><tt>.classification_key</tt>, the single attribute by which the tree classifies points (e.g. <tt>"Vampire?"</tt> for the vampires example, or <tt>"Classification"</tt> for the angel data).
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_parent_branch_name()</tt>: returns the name of the branch leading to this node, or <tt>None</tt> if this is a root node.
+
-
*<b><tt>id_tree_node</tt></b><tt>.is_leaf()</tt>: returns <tt>True</tt> if the node is a leaf (has a classification), otherwise <tt>False</tt>.
+
-
*<b><tt>id_tree_node</tt></b><tt>.set_classifier(classifier)</tt>: Uses the specified <tt>[[#Classifier | Classifier]]</tt> object to add branches below the current node.  Modifies and returns the original <tt>id_tree_node</tt>.  May print warnings if the specified classifier is inadvisable.
+
-
*<b><tt>id_tree_node</tt></b><tt>.apply_classifier(point)</tt>: Applies this node's classifier to a given data point by following the appropriate branch of the tree, then returns the ''child'' node.  If this node is a leaf node (and thus doesn't have a classifier), raises a <tt>ClassifierError</tt>.
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_classifier()</tt>: returns the [[#Classifier | classifier]] associated with this node, if it has one. Otherwise, returns <tt>None</tt>.
+
-
*<b><tt>id_tree_node</tt></b><tt>.set_node_classification(classification)</tt>: Sets this node's classification, thus defining this node as a leaf node.  Modifies and returns the original <tt>id_tree_node</tt>.  May print warnings if the node already has branches defined.
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_node_classification()</tt>: returns this node's classification, if it has one. Otherwise, returns <tt>None</tt>.
+
-
*<b><tt>id_tree_node</tt></b><tt>.get_branches()</tt>: returns a dictionary mapping this node's branch names to its child nodes (e.g. <tt>{'Ruddy':<IdentificationTreeNode object>, 'Normal': <IdentificationTreeNode object>, 'Pale': <IdentificationTreeNode object>}</tt>
+
-
 
+
-
=== Helper Functions for ID Trees ===
+
-
todo remove this section?
+
-
* <tt>is_homogeneous(data, classification_key)</tt>: given a list of data points and a classification key, returns <tt>True</tt> if that data is homogeneous with respect to classifier keyed on <tt>classifcation_key</tt>.
+
-
 
+
-
=== Point (for kNN) ===
+
-
<tt>Point</tt> objects are used in the k-nearest neighbors section only.
+
-
 
+
-
The <tt>Point</tt> class supports iteration (to iterate through the coordinates).
+
-
 
+
-
A <tt>Point</tt> has the following attributes:
+
-
*<b><tt>point</tt></b><tt>.name</tt>, the name of the point (a string), if defined.
+
-
*<b><tt>point</tt></b><tt>.coords</tt>, the coordinates of the point, represented as a vector (a tuple or list of numbers).
+
-
*<b><tt>point</tt></b><tt>.classification</tt>, the classification of the point, if known.
+
-
 
+
-
== Appendix ==
+
-
=== How to handle exceptions ===
+
-
(using NoGoodClassifiersError as an example)
+
-
{| cellpadding=5 border=1 cellspacing=0
+
-
|- align=left bgcolor=#eeeeee
+
-
! Python command !! Effect
+
-
|-
+
-
| <tt>raise NoGoodClassifiersError("message")</tt> || Stops execution by raising, or throwing, a NoGoodClassifiersError with the specified error message
+
-
|-
+
-
| <tt>try:<br>&nbsp;&nbsp;&nbsp;&nbsp;#code here</tt> || Defines the beginning of a block of code that might raise an exception
+
-
|-
+
-
| <tt>except NoGoodClassifiersError:<br>&nbsp;&nbsp;&nbsp;&nbsp;#code here</tt> || Follows a <tt>try</tt> block.  Defines what to do if a NoGoodClassifiersError was raised in the <tt>try</tt> block.
+
-
|}
+

Current revision

Contents


This lab is due by Thursday, October 29 at 10:00pm.

Before working on the lab, you will need to get the code. You can...

  • Use Git on your computer: git clone username@athena.dialup.mit.edu:/mit/6.034/www/labs/fall2020/lab6
  • Use Git on Athena: git clone /mit/6.034/www/labs/fall2020/lab6


All of your answers belong in the main file lab6.py. To submit your lab to the test server, you will need to download your key.py file and put it in either your lab6 directory or its parent directory. You can also view all of your lab submissions and grades here.


Breaking down a problem into smaller sub-problems

For your convenience, we have split this lab into two sections. Part 1 covers the topic of identification trees, and Part 2 covers k-nearest neighbors.

We strongly suggest that you complete Part 1 (ID Trees) by Thursday, October 15.

We strongly suggest that you then complete Part 2 (kNN) by Tuesday, October 20.

The final deadline to complete Lab 6 (both Part 1 (ID Trees) and Part 2 (kNN)) is Thursday, October 29.

Note that if you submit online after completing Part 1 (but before completing Part 2), the tester will fail as expected for kNN tests. Don't fret! As long as you submit and pass all online tests by Thursday, October 29, you will be awarded full lab credit.

Survey

Please answer these questions at the bottom of your lab file:

  • NAME: What is your name? (string)
  • COLLABORATORS: Other than 6.034 staff, whom did you work with on this lab? (string, or empty string if you worked alone)
  • HOW_MANY_HOURS_THIS_LAB_TOOK: Approximately how many hours did you spend on this lab? (number or string)
  • WHAT_I_FOUND_INTERESTING: Which parts of this lab, if any, did you find interesting? (string)
  • WHAT_I_FOUND_BORING: Which parts of this lab, if any, did you find boring or tedious? (string)
  • (optional) SUGGESTIONS: What specific changes would you recommend, if any, to improve this lab for future years? (string)


(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)

When you're done, run the online tester to submit your code.

Personal tools