Intermediate
39 Lessons
3h 3min
Certificate of Completion
Takeaway Skills
A working knowledge of Apache Spark and the PySpark library for Python
A strong understanding of the advantages of using PySpark instead of Pandas for processing large datasets
The ability to calculate some Metrics or produce aggregated analytics reporting solutions
The ability to write Production Code in PySpark
Course Overview
Pandas is a popular Python library used to manipulate data, but it has certain limitations in its ability to process large datasets. The Apache Spark analytics library offers significant performance improvements. This course will help improve your Python-based data processing by leveraging Apache Spark’s multithreading capabilities through the PySpark library. You’ll start by reading data into a PySpark DataFrame before performing basic input/output functions, such as renaming attributes, selecting, and wr...
Course Content
Introduction
Data Input/Output
Data Transformation
User Defined Function (UDF)
Wrapping Up
Appendix
2 Lessons
How You'll Learn
You don’t get better at swimming by watching others. Coding is no different. Practice as you learn with live code environments inside your browser.
Videos are holding you back. Educative‘s interactive, text-based lessons accelerate learning — no setup, downloads, or alt-tabbing required.
Learn faster and smarter with adaptive AI tools embedded in every Educative course.
Built-in assessments let you test your skills. Completion certificates let you show them off.