Big-O Notation and Feature Selection

Fri 13 March 2015
by Jesse Lumarie

Some problems are computationally intractable-- while we know how to program a computer to solve a problem, the computer (or array of computers) would take so much time¹ that calculating the answer wouldn't be worth it. One of these intractable problems is finding all subsets of a large group, as I recently found out when I completed the intro to machine learning course offered by Udacity.

For the final project, students attempted to predict whether an individual was a person of interest in the Enron scandal based on characteristics of the Enron email corpus (a giant file containing about 600,000 emails sent by Enron employees), as well as basic employee characteristics (or features), shown below:

financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances','bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']

email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages','from_this_person_to_poi', 'poi', 'shared_receipt_with_poi']

These features correspond with a flag (or label) which indicates whether that employee was a person of interest (POI). During the course, we used decision tree² and random forest classifying algorithms for many of our classification problems, so I decided to use these algorithms for the final project.

To keep the model simple, I only used the employee features (leaving the Enron corpus for another day). Rather than throwing all of the features into an decision tree or random forests implementation (which in this case leads to poor classifier performance), I tried to select the features that yielded the highest combination of recall and precision.

In machine learning, feature selection is self-explanatory-- it's the process of selecting features for your model. What isn't clear, however, is the best method of feature selection. As someone pretty new to the machine learning, I chose a method which seemed pretty clever at the time, but turned out to be only somewhat useful and not at all scalable.

Twenty features didn't seem like a very large number, small enough that I could iterate through every subset and find the best combination of features. Subsets represent the different ways you can choose items of a particular set³. For example, there are four subsets in the set {A, B}:

{ } choosing nothing at all⁴
{ A } just choosing A
{ B } just choosing B
{ A, B } choosing both A & B

Figuring out the number of subsets from a set of two items is easy enough-- computing all possible combinations of a set of twenty features is not, and that leads me to Big-O notation.

Big-O notation is a way of classifying an algorithm's processing time as the input for that algorithm increases in size. Say you had an algorithm that took a list of names, and sorted them in alphabetical order. All algorithms take time to run⁵. The results of running this name-sorting algorithm on an increasing numbers of names would look something like this:

Number of Names	Time to Compute
1	1 seconds
10	100 seconds
100	10,000 seconds

The time it takes the sorting algorithm to run is the square of the number of name inputs. In Big-O notation, the algorithm would be said to run in O(n²), or quadratic time. There are a number of common algorithm complexities, the most simple being constant time, often written as O(1). An example of constant time would be accessing a single item of an array using its index value (this is because the computer only needs a single instruction to execute this operation, regardless of the instruction's "size")⁶. Finding all subsets of a given set, unfortunately, falls on the opposite end of the time-complexity spectrum.

What happens if you try to compute the number subsets at the same rate as we did for sorting names?

Number of Items in a Set	Number of Subsets
1	2 subsets
10	1,024 subsets
100	1.26 x 10^30 subsets

Yikes⁷. Finding the number of subsets of a given set ends up returning very large numbers very fast, at an exponential rate in fact. This type of complexity is known unceremoniously as exponential time, and is written as O(2ⁿ).

You'll note that I didn't use a time unit for the subset. Even if the algorithm took no time to compute, it would still need to return the resulting subsets somehow, which in itself would take time. In the impossible situation of a algorithm with zero time cost, an algorithm with O(2ⁿ) is nonetheless computationally intractable, as the number of items returned is so big that it would take an exponentially longer time run on larger sets.

Let's bring it back to feature selection. While I wanted sort through all twenty features and choose the best combination, this just wasn't possible given the computational complexity of finding subsets. So what did I do?

I used a combination of intuition and testing to get the universe of features down to a reasonable number (twelve, which still took awhile to run), and the ran the subset algorithm from there. I chose features by measuring the recall and precision that resulted after removing features which appeared unhelpful on their face, as well as by looking at the 'feature_importances_' attribute of sklearn's implementation of random forests.

While this is not ideal ⁸, I ended up with some okay results⁹ and a practical lesson in alogrithm complexity and Big-O Notation: respect the exponent and its power(s).

Certain types of encryption fall into this category. In the case of AES-256 encryption, too long is greater than the age of the universe. ↩
Every time I think of decision trees, I think of these guys. ↩
Check out mathisfun.com for more information about sets and subsets. ↩
Called an 'empty set'. ↩
Citation needed. ↩
Wikipedia offers a pretty great explanation of the various time complexities. ↩
That escalated ~~quickly~~ exponentially. ↩
Since completing the project, I learned that sklearn actually offers tools that can be used to pare down features recursively-- perhaps I'll explore those in another post. ↩
Precision: 0.59184 Recall: 0.52200 (we were shooting for better than .30 for both) ↩