HomeCoursesFrom Pandas to PySpark DataFrame

AI-powered learning

Save

From Pandas to PySpark DataFrame

Gain insights into enhancing Python data processing with PySpark. Delve into reading, transforming, aggregating data, and creating user-defined functions, boosting efficiency with Apache Spark.

4.7

39 Lessons

3h 3min

Join 3 million developers at

LEARNING OBJECTIVES

A working knowledge of Apache Spark and the PySpark library for Python
A strong understanding of the advantages of using PySpark instead of Pandas for processing large datasets
The ability to calculate some Metrics or produce aggregated analytics reporting solutions
The ability to write Production Code in PySpark

Learning Roadmap

39 Lessons3 Quizzes

Introduction

Learn how to use PySpark for large-scale data processing and Amazon Review Data analysis.

Getting Started

Overview of Dataset

Data Input/Output

Walk through data input/output processes including reading, renaming, selecting, saving, and challenges.

Introduction to Data Input and Output

Read Data into DataFrame

Rename Attributes

Select a Subset of Attributes

Data Input and Output: Save a Snapshot

Read Parquet Data Source

Write Production Code

Quiz: Data Input and Output

Challenge: Data Input and Output

Solution: Data Input and Output

Data Transformation

16 Lessons

Work your way through transforming data, handling date-time, imputing, and evaluating reviews using pandas and PySpark.

User Defined Function (UDF)

8 Lessons

Build a foundation in creating and using UDFs in PySpark for custom transformations.

Wrapping Up

Solve problems in PySpark and pandas with newly acquired foundational skills.

Conclusion

Appendix

2 Lessons

Focus on the Amazon Review Data (2018) and Pandas vs. PySpark performance.

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Project

Premium

Certificate of Completion

Showcase your accomplishment by sharing your certificate of completion.

Developed by MAANG Engineers

ABOUT THIS COURSE

Pandas is a popular Python library used to manipulate data, but it has certain limitations in its ability to process large datasets. The Apache Spark analytics library offers significant performance improvements. This course will help improve your Python-based data processing by leveraging Apache Spark’s multithreading capabilities through the PySpark library. You’ll start by reading data into a PySpark DataFrame before performing basic input/output functions, such as renaming attributes, selecting, and writing data. You’ll move onto transformation functions like aggregation, statistical analysis, and joins before creating custom, user-defined functions. At each step, you’ll get a quick Pandas review before being walked through leveraging the more robust PySpark library to unlock Apache Spark. By the end of this course, you’ll be able to quickly and reliably process large amounts of data, even stored across multiple files, using PySpark.

ABOUT THE AUTHOR

MrDataPsycho

Data science product developer, Cloud Native Data Science advocate and Author.

Learn more about MrDataPsycho

Trusted by 3 million developers working at companies

These are high-quality courses. Trust me the price is worth it for the content quality. Educative came at the right time in my career. I'm understanding topics better than with any book or online video tutorial I've done. Truly made for developers. Thanks

Anthony Walker

@_webarchitect_

Just finished my first full #ML course: Machine learning for Software Engineers from Educative, Inc. ... Highly recommend!

Evan Dunbar

ML Engineer

You guys are the gold standard of crash-courses... Narrow enough that it doesn't need years of study or a full blown book to get the gist, but broad enough that an afternoon of Googling doesn't cut it.

Software Developer

Carlos Matias La Borde

I spend my days and nights on Educative. It is indispensable. It is such a unique and reader-friendly site

Souvik Kundu

Front-end Developer

Your courses are simply awesome, the depth they go into and the breadth of coverage is so good that I don't have to refer to 10 different websites looking for interview topics and content.

Vinay Krishnaiah

Software Developer