Introduction to Distributed Systems for Dummies/

...

The Rise of Data

Learn why data is the core of distributed systems.

We'll cover the following...

Data intensity–the new norm

Over the last two decades, we have seen “the rise of data”.

Modern systems are data-centric to the point that we take them for granted. In reality, our use of data has changed drastically over time.

Think of phone applications, websites, software,anything you use on a day to day basis. These are reading and writing data all the time in their systems.

This might be very obvious to you. A system has to handle data, either actively or passively. What’s so rising about it? – you might ask. What has changed over the past few decades is the intensity of data in all kinds of systems.

In short, modern systems are data-driven. These systems collect data, process it, and generate useful insights from the data to drive business and growth.

Let’s discuss these steps—collection, storage, processing, and insight generation.

Data collection

Modern systems collect data from their users. Not all data is the same in type and value. There are two categories we’d like to mention.

Personal data

Personal data is data that is personal and specific to users. For example, on Facebook, generally whatever you have on your profile is personal data, such as your name, date of birth, credit card credentials, places you have been, the food you like, movies you have watched, friends you have made, products you have purchased, etc.

Personal data is sensitive. If the system is dealing with this type of data, there are lots of rules, regulations, and compliance matters that have to be followed.

User-interaction data

Modern systems also collect user-interaction data in their systems as events. Things you do on the website are generally logged as events and later processed to generate insights. As an example, think of an e-commerce website.

As a user, when you search, view, click, or order on the e-commerce website, an event is fired to the backend system. The backend receives the events and stores them in some persistent data storage for further processing.

Note that events do not only mean user interaction. Things happening in your system may also be logged as events and propagated to other parts for storage and processing. For example, if your system blocks a user for some fraudulent activities, that may also be logged as an event.

Data storage

Collected data has to be stored somewhere for it to be usable. This is where system owners decide what type of storage they would use and which database to choose from among the numerous options available.

As expected, there is no hard and fast rule for the best choice of database. It really depends on what data looks like, the volume of data, and what you want to use the data for. To give some examples:

If the data you have is transactional (for example money transactions between users), it’s likely you will choose a SQL database like MySQL or PostgreSQL.
If your data requires very fast retrieval to support queries from users, key-value stores are a feasible option.
If the volume is high and the retrieval speed may be slow, then block storage is a viable option. In block storage, data is stored in files and broken up into multiple parts if required.

So, in summary, the collected data is stored in nodes, and for that, you need to choose proper data storage technology. The choice has to be based on your system requirements and business use-cases.

Data processing

The data your system receives may not necessarily be used in its exact form it is received in. In the backend servers, you will need to process the data by filtering, extending or updating schemas, adding or removing attributes, etc. Whatever comes from the clients will have the info you need but you need to make it usable.

This might sound trivial to you. It’s just reading data as input, running some code on it, and then outputting what is required, right?

In computer science, every problem is trivial as long as the input size is small. Remember, you always have the $O(2^n)$ algorithm at your disposal.

- Author

In modern times, many systems need to deal with a large user base which means a huge amount of data. Processing this huge amount of data is not a trivial task. Over the years, many different technologies have emerged that simplify the processing of massive amounts of data.

We will discuss more in a later part of the course.

Insights generation

The last common step in distributed systems is gathering insights from data. Whatever data you collect and process, is fed into some part of the system which is responsible for generating useful information and insights for driving business and growth.

For instance, some businesses may want to predict user behavior to choose a proper marketing strategy. To achieve this, the user-interaction data you collected can be fed into a machine learning model to predict user behavior.

Another example would be generating aggregations and deriving revenue from the data you have. If your business is subscription-based and users have a certain quota every month, you will need to generate insights from user-interaction to make sure they are capped to the usage limit.

Key takeaways

Systems in modern times have become very data intensive.
For a significant number of companies, their business is driven by data.
It’s crucial that system owners are well versed in the core concepts of data handling and processing on a large scale.
Carefully collecting, processing, storing, and using data is critical for a business to succeed.

1.Introduction

2.What Distributed Systems Achieve for Us

3.Data in Distributed Systems

4.Communication Between Nodes

5.Data Processing in Large Scale

6.Distributed System Architectural Patterns

7.Case Study 1: Apache Spark

8.Case Study 2: Apache Druid

9.Conclusion