The No Nonsense Introduction To Big Data, Hadoop and Streaming/

...

What is Big Data, And Why is it Popular?

Learn about big data and V-indices, and why they are gaining an uprising interest.

We'll cover the following...

Overview
How big is “big data?"
Measuring with V-indices
Importance of big data
The big data trend
The Growth of big data
How is the data generated?

Overview

Big data refers to data that typically cannot be handled and analyzed using traditional techniques. If you need to make a quick decision based on a big data dataset, the computation time of days might be a business killer.

As the name implies, big data is a large dataset that cannot be effectively processed by traditional processing techniques. One such traditional technique is to perform SQL queries on our relational database that contains this dataset.

Although the idea of big data has existed for a long time, technological advancements now make it possible to deal with it effectively in terms of storing, processing, and speed.

How big is “big data?"

Most people associate big data with a certain threshold of terabytes, petabytes, or exabytes. While this might make sense, there is no official definition of big data in terms of size.

As a simple rule though, big data refers to datasets that are not manageable in a traditional way, with single computing and storage. You can think of big data as any quantity of data greater than 5TB.

Again, this is very simplistic but will act as our working definition for this course.

Measuring with V-indices

Big data can be divided into four dimensions: volume, variety, velocity, and veracity.

Volume: The main characteristic that defines big data is its sheer volume. Many people question how many terabytes, exabytes, or petabytes of data we need before it is considered big data. However, it does not make sense to focus on minimum storage units, as the total amount of available data grows exponentially every year.
Velocity: How fast is the data generated? Is the dataset stale like “all the geographical points on earth that received heavy snowfall since last week”, or updated multiple times every second like “what people are currently watching on Netflix, Youtube, and Amazon Prime?”
Variety: How different is the dataset? Do we keep just text logs, just images, or just sound files? Or does the set include a combination of multiple types of data?
Veracity: What is the quality of the data? Does it have noise and inconsistency? For example, do we describe the location of a user in a standard format? Or do some users describe it with string, whereas others with latitude/longitude coordinates? How many users have left the date of birth empty?

Note: It is essential to note that we are not concerned with the “bigness” of our data at the moment. What is important is to choose the best process to make the most out of our dataset.

Importance of big data

Nowadays, knowledge is power, and a big data dataset can contain a lot of knowledge if analyzed correctly.

This can prove crucial for a modern company’s success.

The big data trend

The term big data has gained popularity in the past few years, mainly for the following reasons:

Ubiquitous computing: Even a smart refrigerator can give information about someone, such as our habits or health issues. For instance, if 98% of foods in your refrigerator have a low glycemic index, the refrigerator might estimate that you have some diabetic disease.
Smartphones: Did you know that your smartphone contains quite a few sensors that are utilized by the underlying operating system? Every time your smartphone moves around (for instance, when we commute, talk or walk), the operating system records these movements, which can be analyzed later.

If we apply this to the various applications that track our usage, multiplied by a few billion smartphones globally, we’ll see that we can generate data about ourselves through our devices multiple times per second.

Free services: Companies that offer high-quality products free of charge usually want to utilize our data and make money from it, typically through targeted advertising.

In this way, Facebook and Google make a lot of their money acting as advertising companies.

Similarly, Amazon uses our activity trail to offer us relevant products and increase the possibility of us buying the recommended product.

Don’t take this the wrong way; we’re not saying this is a good or bad thing. It is up to the user to decide what they are comfortable with sharing this data.

Looking at how Facebook, Google, and Amazon use big data can help us understand why the amount and value of data have become bigger and bigger in the past few years. You can think of value as another V-index.

The Growth of big data

The data size is increasing exponentially once every 2-3 years. At the time of writing this, roughly 90% of the world’s data was produced in the past few years.

Due to faster and better network bandwidth, transmitting vast amounts of data is now trivial. The drop in cost to procure disk storage and memory, along with the increased size of these storage mediums has provided new opportunities.

The rate at which the data is growing over the years can be observed in the graph below:

How is the data generated?

Usually, our machine will handle data generation. If you listen to a song in a music service, this event will be recorded from the client (browser or app), and at some point will be sent to a place, called a data lake. We will explain this in more detail later. This is also referred to as ingestion and persistence.

The service can create recommendations for you through a series of similar events, such as what songs you liked. This will happen after analyzing the data and comparing them to other users’ data.

This transfer from data generation to useful recommendation is not an instant process. It can take from a few seconds to a couple of hours. However, the faster we do it, the better it is for business.

Before We Begin

Setting The Stage

The Hadoop Ecosystem

Streaming

Apache Spark

Conclusion