Multiple Partitions

Learn how to divide a dataset into multiple partitions based on certain criteria.

Overview

Our goal is to separate testing and training data. There’s a tiny bump in the road, however, called deduplication. The statistical measures of overall quality rely on the training and testing sets being independent; this means we need to avoid duplicate samples being split between testing and training sets. Before we can create testing and training partitions, we need to find any duplicates.

We can’t—easily—compare each sample with all of the other samples. For a large set of samples, this may take a very long time. A pool of ten thousand samples would lead to 100 million checks for duplication. This isn’t practical. Instead, we can partition our data into subgroups where the values for all the measured features are likely to be equal. Then, from those subgroups, we can choose testing and training samples. This lets us avoid comparing every sample with all of the other samples to look for duplicates.

Mathematical interpretation of partitioning

If we use Python’s internal hash values, we can create buckets containing samples that may have equal values. In Python, if items are equal, they must have the same integer hash value. The inverse is not true: items may coincidentally have the same hash value, but may not actually be equal.

Formally, we can say this:

𝑎=𝑏h(𝑎)=h(𝑏)𝑎 = 𝑏 ⇒ h(𝑎) = h(𝑏)

That is, if two objects in Python, aa and bb ...