Spark and Big Data
Learn about the fundamentals of big data and how Apache Spark fits into processing large datasets. Discover the big data life cycle, batch processing techniques, and how Spark handles data ingestion, transformation, and distributed processing for scalable computation.
We'll cover the following...
Big data primer
Before we describe the processing model that Spark fits into in both the context of this course and big data, it’s important to explain what big data means.
The term big data fundamentally refers to various technologies aligned with different strategies on how to process large datasets of information.
The word “large” has traditionally and implicitly included the notion that whatever dataset is being processed, it packs an amount of information that realistically cannot be processed by a single resource, such as a lone server or computer. Because available processing power and business needs are constantly changing, the word also includes the notion that the exact size of a dataset is not estimated to a specific figure.
As vague as it might seem, “big” is an appropriate word to refer to datasets that are undefined by the limits of their size while representing vast volumes ...