There are several topics that are important to game data science that haven't been discussed in this course. However, the course would not be complete without at least mentioning these areas of interest and discussing their implications on game data science.

Distributed big data

With game data now reaching millions and millions of data points collected from millions of users, see OpenData API (API for Dota 2) as an example, techniques for dealing with big data become important. In terms of data storage and analytics architecture, there have been several advancements over the architecture we used in this course. In the course, for the sake of simplicity, we chose to use log files or CSV files containing all user sessions and information. However, this is not how the games industry is now storing its data. The games industry has also been innovating on the ways data is stored by moving to the cloud and serverless computing, where there is a separation between the hardware where data is stored and the software by which the data is accessed and analyzed. A great resource discussing the evolution of analytics architecture is Weber (2018).

Beyond storage, modeling is also impacted by issues of scale. Most often, with such data, it is hard to upload the whole data into memory, let alone do processing or visualize the whole dataset. There are many techniques for dealing with this issue developed by the data science community. These include distributed processing over distributed databases. One approach currently in practice is to distribute the data. Algorithms are then developed to update parameters based on their share of data. Another approach is to distribute the model and allow access to the entire data as each model updates its own parameters. Synchronization happens only when necessary.

Challenges in distributed data

However, distributed processing poses many important challenges that we would like to discuss here. First, not all the techniques discussed in this course can be used in a distributed-based architecture. Most often, to parallelize such algorithms, one will need to summarize data or aggregate along the way. Second, there are many questions pertaining to how one can do exploratory data analysis with such distributed architecture.


A great example of how this is done is discussed by Weber (2019). In his article, he discusses how his team developed a distributed serverless platform for analyzing millions of user data through the use of Python and Spark, a fast- general-purpose cluster computing system developed by Apache. He discussed the development of a pipeline composed of several processes, including data extraction, feature engineering, feature application, model training, and model publishing. While these seem similar to what we discussed in the course, the process of dealing with big data involves more abstraction and summarization as data is moved from one process to the other. Furthermore, using Python and Spark enabled them to apply many of the data processing functions, which are usually done on a DataFrame (as the ones we discussed in this course), to distributed DataFrames. Interested learners are advised to consult the paper and these venues. It should be noted that R also has libraries to handle distributed databases; please see the Parallel and Future packages.

Spatio-temporal analysis

Another important area that we did not discuss in the course is the concept of spatio-temporal analysis. While the discussion provided in the data analysis through visualization chapter shows how a visualization system can be used for spatio-temporal analysis of game data, there’s more that has been done in this area.

Importance of spatio-temporal analysis

Spatio-temporal analytics is vital in game data science for the simple reason that playing games involve action in both of these dimensions, that is, time and space. How do our players experience our games occurring within these dimensions? Game user research, specifically, has to work with this understanding, and some of the most famous examples of game data science work have been developed via user testing because user researchers needed to harness the utility of telemetry data (e.g., Kim et al., 2008). Therefore, it’s very important to consider them, and we encourage learners to investigate the references provided in this section.

To begin with, many games involve spatial operations. Whether simple point-and-click vector mechanics, navigation in 2D environments like side-scroller games, or fully-fledged 3D avatar-based movement. The spatial component often forms one of the basic elements in the playing experience. Therefore, the analysis and evaluation of spatial behaviors are important. A good example is discussed by Ubisoft in their analysis of players’ paths through the Assassin’s Creed series. See Dankoff ’s work (2011).

Similarly, the component of time is important. All gameplay occurs over time, and it is therefore common to consider this dimension. Many of the basic metrics like daily active users and user lifetime revenue, discussed in the introductory chapter, are computed using temporal measures. We know time is important also from churn prediction, progression analysis, time spent analysis, and similar problems that are everyday bread-and-butter in game data science.

Get hands-on with 1200+ tech skills courses.