I got interested in the following question lately: given a data set of examples with some continuous-valued features and discrete classes, what’s a good way to reduce the continuous features into a set of discrete values?

What makes this question interesting? One very specific reason is that some machine learning algorithms, like Decision Trees, require discrete features. As a result, potentially informative data has to be discarded. For example, consider the Titanic dataset: we know the age of passengers of the Titanic, or how much they paid for their ticket. To use these features, we would need to reduce them to a set of states, like “Old/Young” or “Cheap/Medium/Expensive” – but how can we determine what states are appropriate, and what values separate them?

More generally, it’s easier to reason about a handful of cases than a continuous variable – and it’s also more convenient computationally to represent information as a finite set states.

So how could we go about identifying a reasonable way to partition a continuous variable into a handful of informative, representative states?

In the context of a classification problem, what we are interested in is whether the states provide information with respect to the Classes we are trying to recognize. As far as I can tell from my cursory review of what’s out there, the main approaches use either Chi-Square tests or Entropy to achieve that goal. I’ll leave aside Chi-Square based approaches for today, and look into the Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani in 1993.

## The algorithm idea

The algorithm hinges on two key ideas:

- Data should be split into intervals that maximize the information, measured by Entropy,
- Partitioning should not be too fine-grained, to avoid over-fitting.

The first part is classic: given a data set, split in two halves, based on whether the continuous value is above or below the “splitting value”, and compute the gain in entropy. Out of all possibly splitting values, take the one that generates the best gain – and repeat in a recursive fashion.

Let’s illustrate on an artificial example – our output can take 2 values, Yes or No, and we have one continuous-valued feature:

Continuous Feature | Output Class |

1.0 | Yes |

1.0 | Yes |

2.0 | No |

3.0 | Yes |

3.0 | No |

As is, the dataset has an Entropy of H = - 0.6 x Log (0.6) – 0.4 x Log (0.4) = 0.67 (5 examples, with 3/5 Yes, and 2/5 No).

The Continuous Feature takes 3 values: 1.0, 2.0 and 3.0, which leaves us with 2 possible splits: strictly less than 2, or strictly less than 3. Suppose we split on 2.0 – we would get 2 groups. Group 1 contains Examples where the Feature is less than 2:

Continuous Feature | Output Class |

1.0 | Yes |

1.0 | Yes |

The Entropy of Group 1 is H(g1) = - 1.0 x Log(1.0) = 0.0

Group 2 contains the rest of the examples:

Continuous Feature | Output Class |

2.0 | No |

3.0 | Yes |

3.0 | No |

The Entropy of Group 2 is H(g2) = - 0.33 x Log(0.33) – 0.66 x Log(0.66) = 0.63

Partitioning on 2.0 gives us a gain of H – 2/5 x H(g1) – 3/5 x H(g2) = 0.67 – 0.4 x 0.0 – 0.6 x 0.63 = 0.04. That split gives us additional information on the output, which seems intuitively correct, as one of the groups is now formed purely of “Yes”. In a similar fashion, we can compute the information gain of splitting around the other possible value, 3.0, which would give us a gain of 0.67 – 0.6 x 0.63 – 0.4 x 0.69 = - 0.00: that split doesn’t improve information, so we would use the first split (or, if we had multiple splits with positive gain, we would take the split leading to the largest gain).

So why not just recursively apply that procedure, and split our dataset until we cannot achieve information gain by splitting further? The issue is that we might end up with an artificially fine-grained partition, over-fitting the data.

More...

## Comments

- It takes one to know one (3)
- Create optimization programs dynamically with C# and the Microsoft Solver Foundation (8)
- Management, the Gordon Ramsay way (4)
- VSTO add-in with multiple assemblies (5)

Comment RSSRon wrote: Statistics are used much like a drunk uses a lampp... [More]

Ron wrote: I just started tinkering with Solver the other day... [More]

bryan tyler nelson wrote: I do not even understand how I stopped up right he... [More]

prestamo wrote: What i do not realize is in truth how you are now ... [More]