Introduction to Data Scrubbing

We will go over how to manipulate data and prepare it for further analysis or modeling.

Why do we need data scrubbing?

Like any Swiss or Japanese watch, a good machine learning model should run smoothly and contain no extra parts. This means avoiding syntax or other errors that prevent the code from executing as well as removing redundant variables that might clog up the model’s decision path.

This bias towards simplicity is just as important for beginners coding their first model. Working with a new algorithm helps create a minimal viable model which can then have complexity added later. If you find yourself at an impasse, look at the troublesome element and ask, “Do I need it?” If the model can’t handle missing values or multiple variable types, the quickest cure is to remove those variables. This should help the afflicted model spring to life and breathe normally. Once the model is working, you can go back and add complexity to your code.

Let’s now look at specific data scrubbing techniques to prepare, streamline, and optimize the data for analysis.

What is data scrubbing?

Data scrubbing is an umbrella term for manipulating data in preparation for analysis.

Some algorithms, for example, don’t recognize specific data types or return an error message in response to missing values or non-numeric input. Variables may also need to be scaled to size or converted to a more compatible data type.

Linear regression, for example, analyzes continuous variables. Gradient boosting, on the other hand, asks that both discrete (categorical) and continuous variables are expressed numerically as an integer or floating-point number.

Duplicate information, redundant variables, and errors in the data are other problems that often conspire to derail a model’s capacity to dispense valuable insight.

Another potential consideration when working with data, specifically private data, is removing personal identifiers that could contravene relevant data privacy regulations or damage the trust of customers, users, and other stakeholders. Again, this is less of a problem for publicly available datasets but something to be mindful of when working with private data.

Data Scrubbing Operations

The following are data scrubbing operations:

  • Removing variables
  • One-hot encoding
  • Drop missing values
  • Dimension reduction

In the coming lessons, we will discuss each operation one by one in detail.

Get hands-on with 1200+ tech skills courses.