Many Categories Impurity
Explore how the CART algorithm manages categorical features with multiple levels in classification trees by optimizing Gini impurity calculations. Understand the iterative process used to find the best binary splits for data with many categories, improving tree accuracy and reducing computational effort.
We'll cover the following...
Multivalue attributes
When building decision trees, the CART algorithm uses only two-way (i.e., binary) data splits. CART classification trees are constructed using the Gini gain calculation. This lesson expands this knowledge by teaching how the CART classification tree algorithm handles a widespread situation in business data—categorical features with more than two values.
Consider the following Adult Census Income data sample:
Adult Census Income Data Sample
Occupation | Income |
Adm-clerical | <=50K |
Exec-managerial | <=50K |
Handlers-cleaners | <=50K |
Handlers-cleaners | <=50K |
Prof-specialty | <=50K |
Exec-managerial | <=50K |
Other-service | <=50K |
Exec-managerial | >50K |
Prof-specialty | >50K |
Exec-managerial | >50K |
In this data sample, the occupation feature has five distinct