Search⌘ K
AI Features

Many Categories Impurity

Explore how the CART algorithm manages categorical features with multiple levels in classification trees by optimizing Gini impurity calculations. Understand the iterative process used to find the best binary splits for data with many categories, improving tree accuracy and reducing computational effort.

Multivalue attributes

When building decision trees, the CART algorithm uses only two-way (i.e., binary) data splits. CART classification trees are constructed using the Gini gain calculation. This lesson expands this knowledge by teaching how the CART classification tree algorithm handles a widespread situation in business data—categorical features with more than two values.

Consider the following Adult Census Income data sample:

Adult Census Income Data Sample

Occupation

Income

Adm-clerical

<=50K

Exec-managerial

<=50K

Handlers-cleaners

<=50K

Handlers-cleaners

<=50K

Prof-specialty

<=50K

Exec-managerial

<=50K

Other-service

<=50K

Exec-managerial

>50K

Prof-specialty

>50K

Exec-managerial

>50K

In this data sample, the occupation feature has five distinct values:Known as levels in R ...