Introduction to Clustering of Sequences

Previously, we looked into how to visualize sequence data and how to find the most frequent sequences. However, these methods do not give us a way to understand the differences between players or ways to group them based on their behaviors. In this section, we’ll look into methods for clustering sequences. This is important because we can profile players by grouping them based on their actions. Furthermore, we can understand more about how players exhibit similar problem-solving strategies if we can group them based on their behaviors.

To cluster sequences, we can use any of the clustering algorithms discussed previously. As we previously discussed, clustering algorithms require a distance function in order to perform cluster analysis and output clusters. Previous research resulted in several methods that can help us develop a good distance metric for sequences. In this section, we'll discuss some of these methods but focus on OM, which we have found to be the most effective.

Dealing with the continuous time-series

It should be noted that we are assuming that states and actions are discrete or that any continuous data has been turned into discrete values through binning, as discussed earlier. If we’re dealing with continuous time-series data, the development of distance metrics can be done using dynamic time warping (DTW) provided by the DTW library in R.

Example

DTW is a mathematical method used to calculate the similarity between continuous streams of data given a set of constraints and costs. An example of data types one can use DTW for is biometric data, such as EEG. Since most game data can be turned into discrete values, we will discuss OM instead of DTW in this chapter.

The Clustering sequences lab of this chapter contains a walkthrough of the functions that can be used to perform OM, clustering on the sequence data, and distance measures produced by OM. We will discuss some of the results from this lab here to give us an understanding of the method and its value.

Methods for similarity measure

There are two ways to develop a similarity measure between sequences:

• By counting the dissimilar attributes.

• By calculating the amount of effort required to edit one sequence to match the other.

In the next two subsections, we'll discuss these methods in more detail.

Distance measures based on counts

Get hands-on with 1200+ tech skills courses.