Many Categories Impurity

Learn how CART decision trees handle categorical features with more than two categories.

We'll cover the following...

Multivalue attributes

When building decision trees, the CART algorithm uses only two-way (i.e., binary) data splits. CART classification trees are constructed using the Gini gain calculation. This lesson expands this knowledge by teaching how the CART classification tree algorithm handles a widespread situation in business data—categorical features with more than two values.

Consider the following Adult Census Income data sample:

Adult Census Income Data Sample

Occupation

Income

Adm-clerical

<=50K

Exec-managerial

<=50K

Handlers-cleaners

<=50K

Handlers-cleaners

<=50K

Prof-specialty

<=50K

Exec-managerial

<=50K

Other-service

<=50K

Exec-managerial

>50K

Prof-specialty

>50K

Exec-managerial

>50K

In this data sample, the occupation feature has five distinct values:Known as levels in R ...