Multiple Partitions

Learn how to divide a dataset into multiple partitions based on certain criteria.

We'll cover the following...

Overview
Mathematical interpretation of partitioning
Splitting samples

Overview

Our goal is to separate testing and training data. There’s a tiny bump in the road, however, called deduplication. The statistical measures of overall quality rely on the training and testing sets being independent; this means we need to avoid duplicate samples being split between testing and training sets. Before we can create testing and training partitions, we need to find any duplicates.

We can’t—easily—compare each sample with all of the other samples. For a large set of samples, this may take a very long time. A pool of ten thousand samples would lead to 100 million checks for duplication. This isn’t practical. Instead, we can partition our data into subgroups where the values for all the measured features are likely to be equal. Then, from those subgroups, we can choose testing and training samples. This lets us avoid comparing every sample with all of the other samples to look for duplicates.

Mathematical interpretation of partitioning

If we use Python’s internal hash values, we can create buckets containing samples that may have equal values. In Python, if items are equal, they must have the same integer hash value. The inverse is not true: items may coincidentally have the same hash value, but may not actually be equal.

Formally, we can say this:

$𝑎 = 𝑏 ⇒ h(𝑎) = h(𝑏)$

That is, if two objects in Python, $a$ and $b$ ...

1.Object-Oriented Design

2.Objects in Python

3.When Objects Are Alike

4.Expecting the Unexpected

5.When to Use Object-Oriented Programming

6.Abstract Base Classes and Operator Overloading

7.Python Data Structures

8.Object-Oriented and Functional Programming Intersection

9.Strings, Serialization, and File Paths

10.The Iterator Pattern

11.Common Design Patterns

12.Advanced Design Patterns

13.Testing Object-Oriented Programs

14.Concurrency

15.Conclusion

Project

Multiple Partitions

Overview

Mathematical interpretation of partitioning