Popularity of Spark

Learn about the instances in which Spark is preferred over more traditional tools.

Spark is a stack of tools that can do anything that a data professional would need. Let’s see a couple of domains where we can use Spark over another data tool.

MLlib vs. Tensorflow/Pytorch

There are numerous super-popular and well-documented frameworks around Machine Learning, like Tensorflow and Pytorch. So use MLlib arise?

A big reason to use any tool from the Spark ecosystem is its distributed nature. In addition to performing in-memory computation, Spark can do it over a distributed file system.

This helps with scaling the process, and you don’t have to learn a second technology that might not be compatible with Spark.

Remember the tennis balls examples from an earlier lesson? We could write a model that predicts which color the next ball would be. Since we already have the data in HDFS, we could utilize the Spark integration with HDFS and run our Machine Learning model there.

Spark and Hadoop MapReduce

Hadoop was a breakthrough in big data processing when it came out. Though it is still a popular tool, Spark outperforms it in many areas, such as performance and real-time needs.

Spark runs on memory, and this alone can be a game-changer.

Applications of Spark

  • Trend calculations.
  • Personalized user experience.
  • Business intelligence (BI).
  • Summarizing a corpus using graph algorithms like TextRank with GraphX.
  • Real-time detection of fraudulent payments using Spark Streaming and MLlib.
  • Implement an ETL pipeline with Spark Streaming.

Get hands-on with 1200+ tech skills courses.