Lab 7
From 6.034 Wiki
Contents |
This lab is due by Friday, December 4 at 10:00pm.
The online tester is now available!
To work on this lab, you will need to get the code:
- You can view the files at: http://web.mit.edu/6.034/www/labs/lab7/
- Download it as a ZIP file: http://web.mit.edu/6.034/www/labs/lab7/lab7.zip
- Or, on Athena, add 6.034 and copy it from /mit/6.034/www/labs/lab7/.
Your answers for this lab belong in the main file lab7.py.
Problems: Adaboost
In this lab, you will code the Adaboost algorithm to perform boosting.
Throughout this lab, we will assume that there are exactly two classifications, meaning that every not-misclassified training point is classified correctly.
Helper functions
Initialize weights
First, implement initialize_weights to assign every training point a weight equal to 1/N, where N is the number of training points. This function takes in a list of training points, where each point is represented as a string. The function should return a dictionary mapping points to weights.
def initialize_weights(training_points):
Calculate error rates
Next, we want to calculate the error rate ε of each classifier. The error rate for a classifier h is the sum of the weights of the training points that h misclassifies.
calculate_error_rates takes in two dictionaries:
- point_to_weight: maps each training point (represented as a string) to its weight (a number).
- classifier_to_misclassified: maps each classifier to a list of the training points that it misclassifies. For example, this dictionary may contain entries such as "classifier_0": ["point_A", "point_C"].
Implement calculate_error_rates to return a dictionary mapping each weak classifier (a string) to its error rate (a number):
def calculate_error_rates(point_to_weight, classifier_to_misclassified):
Pick the best weak classifier
Once we have calculated the error rate of each weak classifier, we need to select the "best" weak classifier. "Best" has two possible definitions:
- "smallest error rate" (use_smallest_error=True)
- "error rate furthest from 1/2" (use_smallest_error=False)
Implement pick_best_classifier to return the name of the "best" weak classifier:
def pick_best_classifier(classifier_to_error_rate, use_smallest_error=True):
There are a few details that can make this function tricky:
- Tie-breaking: In case two or more weak classifiers have equally good error rates, return the one that comes first alphabetically. (Hint: The built-in function sorted may help.)
- Roundoff error: Python sometimes rounds numbers, especially when adding them to dictionaries, which can result in numbers being off by 0.0000000000000001 (10^-16) when they should be equal. Accordingly, you may want to use the helper function fix_roundoff_error after performing float arithmetic, but before comparing equality or passing numbers into min or max. The function fix_roundoff_error takes in a number and corrects it for roundoff error at the 10^-16 place. It can also take in a list of numbers or a dictionary whose values are numbers, then return a new corrected list or dictionary. (Caveat: Try to only use fix_roundoff_error for comparing numbers, as it can apparently introduce a different type of roundoff error if you return floats that have been "fixed".)
- Using min or max on a dictionary: This is something you may want to do, depending on your implementation. It may be helpful to note that min and max can take an optional key argument. (It works the same way as the key argument for sorted, which you may have used in Lab 2.)
Calculate voting power
After selecting the best weak classifier, we'll need to compute its voting power. If ε is the error rate of the weak classifier, then its voting power is:
1/2 * ln((1-ε)/ε)
Implement calculate_voting_power to compute a classifier's voting power, given its error rate 0 ≤ ε ≤ 1. (For your convenience, the constant INF and the natural log function ln have been defined in lab7.py.)
def calculate_voting_power(error_rate):
Hint: What voting power would you give to a weak classifier that classifies all the training points correctly? What if it misclassifies all the training points?
Is H good enough?
One of the three exit conditions for Adaboost is when the overall classifier H is "good enough" -- that is, when H classifies enough training points correctly. The function is_good_enough takes in four arguments:
- H: An overall classifier, represented as a list of (classifier, voting_power) tuples. (Recall that each classifier is represented as a string.)
- training_points: A list of all training points. (Recall that each training point is represented as a string.)
- classifier_to_misclassified: A dictionary mapping each classifier to a list of the training points that it misclassifies.
- mistake_tolerance: The maximum number of points that H is allowed to misclassify.
is_good_enough should return True if H is good enough (does not misclassify too many points), otherwise False. Each training point's classification is determined by a weighted vote of the weak classifiers in H. (If the vote is a tie, consider the training point to be misclassified.)
Implement:
def is_good_enough(H, training_points, classifier_to_misclassified, mistake_tolerance=0):
Hint: You don't need to know the the actual classification of a training point. Because there are only two classifications, knowing whether the training point was misclassified should be sufficient.
Update weights
After each round, Adaboost updates weights in preparation for the next round. The updated weight for each point depends on whether the point was classified correctly or incorrectly by the current weak classifier:
- If the point was classified correctly: new weight = 1/2 * 1/(1-ε) * (old weight)
- If the point was misclassified: new weight = 1/2 * 1/ε * (old weight)
The function update_weights takes in 3 arguments:
- point_to_weight: A dictionary mapping each point to its current weight. You are allowed to modify this dictionary, but are not required to.
- misclassified_points: A list of points misclassified by the current weak classifier.
- error_rate: The error rate ε of the current weak classifier.
Implement update_weights to return a dictionary mapping each point to its new weight:
def update_weights(point_to_weight, misclassified_points, error_rate):
Note: This function should not require fix_roundoff_error. (In fact, in some cases, using fix_roundoff_error in update_weights has been observed to result in accumulated errors later on.)
Adaboost
Using all the helper functions you've written above, implement the Adaboost algorithm:
def adaboost(training_points, classifier_to_misclassified, use_smallest_error=True, mistake_tolerance=0, max_num_rounds=INF):
Keep in mind that Adaboost has three exit conditions:
- If H is "good enough" (see Is H good enough?, above)
- If it has completed the maximum number of rounds
- If the best weak classifier is no good (has an error rate ε = 1/2)
Adaboost is initialized by setting the weight of every training point to 1/N (where N is the number of training points). Then, Adaboost repeats the following steps until one of the three exit conditions is satisfied.
- Compute the error rate of each weak classifier.
- Pick the best weak classifier h. Note that "best" can mean either "smallest error rate" (use_smallest_error=True) or "error rate furthest from 1/2" (use_smallest_error=False).
- Compute the voting power of the weak classifier h, using its error rate ε.
- Append h to the overall classifier H, as a tuple (h, voting_power).
- Update weights in preparation for the next round.
The function adaboost takes in five arguments:
- training_points: A list of all training points.
- classifier_to_misclassified: A dictionary mapping each classifier to a list of the training points that it misclassifies.
- use_smallest_error: A boolean value indicating which definition of "best" to use: "smallest error rate" or "error rate furthest from 1/2".
- mistake_tolerance: The maximum number of points that H is allowed to misclassify.
- max_num_rounds: The maximum number of rounds of boosting to perform before returning H. (This is equivalent to the maximum number of tuples that can be added to H.)
adaboost should return the overall classifier H, represented as a list of (classifier, voting_power) tuples.
Hints:
- If adaboost fails to exit with the ε=1/2 condition because Python thinks 0.5 != 0.5, you can use fix_roundoff_error to correct the error rate before comparing it to 0.5. (fix_roundoff_error can take in a number, list, or dict.)
- Note that weight updates are based on on which points were misclassified by the current weak classifier, not which points were misclassified by the overall classifier H.
Survey
Please answer these questions at the bottom of your lab6.py file:
- NAME: What is your name? (string)
- COLLABORATORS: Other than 6.034 staff, whom did you work with on this lab? (string, or empty string if you worked alone)
- HOW_MANY_HOURS_THIS_LAB_TOOK: Approximately how many hours did you spend on this lab? (number or string)
- WHAT_I_FOUND_INTERESTING: Which parts of this lab, if any, did you find interesting? (string)
- WHAT_I_FOUND_BORING: Which parts of this lab, if any, did you find boring or tedious? (string)
- (optional) SUGGESTIONS: What specific changes would you recommend, if any, to improve this lab for future years? (string)
(We'd ask which parts you find confusing, but if you're confused you should really ask a TA.)
When you're done, run the online tester to submit your code.