MD Implementation Steps: 3 to 8

This lesson will continue to introduce the implementation steps (3-8) of model design.

3) Exploratory data analysis

The third step, EDAExploratory Data Analysis, provides an opportunity to become familiar with your data including distribution and the state of missing values. Exploratory data analysis also drives the next stage of data scrubbing and your choice of algorithm.

In addition, EDAExploratory Data Analysis may come into play in other sections of your code as you check the size and structure of your dataset and integrate that feedback to direct model optimization.

4) Data scrubbing

The data scrubbing stage usually consumes the most time and effort in developing a prediction model. Like looking after a good pair of dress shoes, it’s important to pay attention to the quality and composition of your data.

This means cleaning up the data, inspecting its value, making repairs, and, ultimately, knowing when to throw it out.

5) Pre-model algorithm (optional)

As an optional extension of the data scrubbing process, unsupervised learning techniques, including k-means clustering analysis and descending dimension algorithms, are sometimes used in preparation for analyzing large and complex datasets.

The k-means clustering technique can reduce row volume by compressing rows into a lower number of clusters based on similar values before conducting further analysis using supervised learning.

This step, though, is optional and does not apply to every model, particularly for small datasets with a low number of dimensions (features) or rows.

6) Split validation

Split validation is used to partition the data to train and test analysis. It’s also useful to randomize your data at this point using the shuffle feature and to set a random state if you want to replicate the model’s output in the future.

7) Set algorithm

Algorithms are the headline act for every machine learning model and must be chosen carefully.

The algorithm is a mathematical-based sequence of steps that reacts to changing patterns to generate a decision or output. By executing a series of steps defined by the algorithm, the model reacts to input variables in order to interpret patterns, make calculations, and reach decisions.

As input data is variable, algorithms can produce different outputs based on the input data. Algorithms are also malleable in that they have hyperparameters that can be adjusted to create a more customized model.

Algorithms are, thus, a moving framework rather than a concrete equation and are customizable based on the target output and the characteristics of the input data.

For context, the algorithm should not be confused or mistaken for the model. The model is the final state of the algorithm; after hyperparameters are consolidated in response to patterns learned from the data and after a combination of data scrubbing, split validation, and evaluation techniques are completed. Below is a list of popular algorithms used in machine learning and their common characteristics.

Get hands-on with 1200+ tech skills courses.