Requirements of Spark
Explore the functional and non-functional requirements of Spark, including data processing capabilities, latency optimization, fault tolerance, and memory management. Learn how Spark handles large datasets with efficient partitioning and worker estimations, enabling robust and scalable iterative computations. Understand how these requirements translate into designing a system capable of high throughput and fault resilience in distributed data processing.
We'll cover the following...
Let's understand the functional and non-functional requirements of Spark.
Functional requirements
The functional requirements of Spark are listed below:
Data processing: The system needs to process a large working dataset efficiently and also be able to do it repeatedly for iterative or interactive queries.
Latency and throughput: Our system should achieve low latency and high throughput for the tasks, like iterative data processing (where we use the same data repeatedly) and performing ad hoc queries on the same dataset. For example, we expect our system to query many terabytes of data in a few seconds. Usually, the first run is slower than the others because our system needs to load the data from the disks that involve IO operations.
Non-functional requirements
Following are the non-functional requirements of Spark.
Fault tolerance: If a
is lost, it should be recovered effectively.data partition For parallel processing data is divided into partitions called data partitions. Data locality: The system should do computations on the worker where the data ...