Introduction to Clustering of Sequences

Learn the basics of clustering of sequences.

We'll cover the following

Dealing with the continuous time-series
- Example
Methods for similarity measure
Distance measures based on counts
Optical Matching (OM)
Variances between clusters
What does Cluster 3 indicate?
- Summary of the example

Previously, we looked into how to visualize sequence data and how to find the most frequent sequences. However, these methods do not give us a way to understand the differences between players or ways to group them based on their behaviors. In this section, we’ll look into methods for clustering sequences. This is important because we can profile players by grouping them based on their actions. Furthermore, we can understand more about how players exhibit similar problem-solving strategies if we can group them based on their behaviors.

To cluster sequences, we can use any of the clustering algorithms discussed previously. As we previously discussed, clustering algorithms require a distance function in order to perform cluster analysis and output clusters. Previous research resulted in several methods that can help us develop a good distance metric for sequences. In this section, we'll discuss some of these methods but focus on OM, which we have found to be the most effective.

Dealing with the continuous time-series

It should be noted that we are assuming that states and actions are discrete or that any continuous data has been turned into discrete values through binning, as discussed earlier. If we’re dealing with continuous time-series data, the development of distance metrics can be done using dynamic time warping (DTW) provided by the DTW library in R.

Example

DTW is a mathematical method used to calculate the similarity between continuous streams of data given a set of constraints and costs. An example of data types one can use DTW for is biometric data, such as EEG. Since most game data can be turned into discrete values, we will discuss OM instead of DTW in this chapter.

The Clustering sequences lab of this chapter contains a walkthrough of the functions that can be used to perform OM, clustering on the sequence data, and distance measures produced by OM. We will discuss some of the results from this lab here to give us an understanding of the method and its value.

Methods for similarity measure

There are two ways to develop a similarity measure between sequences:

By counting the dissimilar attributes.
By calculating the amount of effort required to edit one sequence to match the other.

In the next two subsections, we'll discuss these methods in more detail.

Distance measures based on counts

Get hands-on with 1400+ tech skills courses.

Getting Started

Introduction to Game Data Science

Data Preprocessing

Introduction to Statistics and Probability Theory

Data Abstraction

Data Analysis through Visualization

Clustering Methods in Game Data Science

Supervised Learning in Game Data Science

Model Validation and Evaluation

Introduction to Neural Networks

Sequence Analysis of Game Data

Advanced Sequence Analysis

Case Study: Tom Clancy's The Division (TCTD)

Conclusion and Remarks

Appendix A: Game Used in the Book

Introduction to Clustering of Sequences

Dealing with the continuous time-series

Example

Methods for similarity measure

Distance measures based on counts