Introduction to Big Data

This Lesson is about introduction to Big Data and introduction to Cloud. This describes why , what, how and when of Big Data.

Why is data important?

Data is used by individuals and businesses daily in various every day challenges like managing expenditure, generating mobile bills, deciding which movie to watch based on reviews. It is data and its analysis which helps us to do our day to day activities and take decisions in a better way.

Sources of Data

This flood of data is coming from many sources

• The New York stock exchange generates about 4-5 terabytes of data everyday.

• Facebook hosts more than 240 billion photos,growing at 7 petabytes of data everyday

Ancestory.com, the genealogy site stores around 10 petabytes of the data

• The Internet Archive stores around 18.5 petabytes of data

• The Large Hadron Collider near Geneva produces about 30 Petabytes of data every year

Common Problems using data

• Unimaginable size of data

• Heterogeneous source systems

• Traditional processing systems do not scale up

• RDBMS is costly when data size goes big

What is Big Data?

According to Gartner, Big Data is a high-volume, highvelocity, and high-variety information asset that demands cost-effective, innovative forms of information processing for enhanced insight and decision making

Data multiple’s

Evolution of Big Data

A major Big Data technology is Hadoop. It has seen explosive growth, mission-critical adoption and handles serious volumes of data.

5 V’s of Big Data

Volume - Size can be upto Terabytes

Velocity - Speed at which data is generated and to be used (Batch, Real time, )

Veracity - Trustworthiness of data in terms of accuracy

Variety - Different types of data (Structured i.e. csv files, Unstructured i.e. audio, video, text, Semi-structured i.e. logs)

Value - Use cases on top of sourced Big Data

Solutions

Scale Up - Increase the configuration of a single system, like disk capacity, RAM, data transfer speed, etc.but it is a complex, costly, and a time consuming process

Scale Out - Use multiple commodity (economical) machines and distribute the load of storage/processing among them. This is Economical and quick to implement as it focuses on distribution of load but the challenges are Coordination between networked machines and Handling failures of cheap machines. Big Data technology has to process & analyze data across different machines and then merge the data.

Multi-Thread Programming

Difficult Because

•Don’t know the order in which threads run

•Don’t know when threads interrupt each other

•Thus, We need:

•Semaphores (Lock ,Unlock)

•Conditional Variables( Wait, Notify, Broadcast)

•Barriers

•Still, lots of problems:

•Deadlock, livelock

•Race conditions

•…….

•Moral of the Story : be careful!

What are the industry use cases where we have Big Data?

Social media : According to IBM, the Big Data technology has helped turn the 12 terabytes of tweets created daily into improved product sentiment analysis.

Finance: Big Data technology has scrutinized 5 million trade events created daily to identify potential frauds. It has helped in analyzing 500 million daily call detail records in real time to predict the “customer churn” faster

Government: Big Data technology has helped monitor hundreds of live video feeds from surveillance cameras to target points of interest for security agencies.

Banking: Banks having separate data repositories from multiple departments across countries can consolidate and combine them into a single repository and analyse and use the data globally by creating an unified Datalake.

Health: Fitness wearable devices generate a lot of data on the fly using IoT Protocols, this real time data generated should be used to derive value by predicting diseases, alerting doctors during heart attacks and showing performance dashboard for a particular diet and lifestyle recommendation.

Customer Churn- Analyse large datsets having customer data to understand why the customers are leaving and which factors e.g. pricing, coverage network, device issues