frompandastopyspark.tar.gz

PandasAndPyspark

jupyter

spa-jupyter

spa-snapshot

spa-copy

spa-copy-copy

spa-copy-copy-5klyxp

spa-snapshot-copy

spa-snapshot-copy-copy

spa-copy-copy-5klyxp-copy

Pandas is a popular Python library used to manipulate data, but it has certain limitations in its ability to process large datasets. The Apache Spark analytics library offers significant performance improvements.

This course will help improve your Python-based data processing by leveraging Apache Spark’s multithreading capabilities through the PySpark library. You’ll start by reading data into a PySpark DataFrame before performing basic input/output functions, such as renaming attributes, selecting, and writing data. You’ll move onto transformation functions like aggregation, statistical analysis, and joins before creating custom, user-defined functions. At each step, you’ll get a quick Pandas review before being walked through leveraging the more robust PySpark library to unlock Apache Spark.

By the end of this course, you’ll be able to quickly and reliably process large amounts of data, even stored across multiple files, using PySpark.

From Pandas to PySpark DataFrame

## DataFrames in global scope
The following code is an example of small DataFrames in the **global scope**, which should be converted into a series of functions so that we can avoid polluting the global scope:

```Python
total_review_by_mth_df = (
    main_df
    .groupBy('review_year','review_month') 
    .agg(fn.count(col("asin"))
    .alias("total_review")) 
    .orderBy('review_year', 'review_month')
)
total_review_2016 = total_review_by_mth_df.filter(col("review_year") == 2016)
total_review_2017 = total_review_by_mth_df.filter(col("review_year") == 2017)
merged_20_16_17 = (
    total_review_2016
    .select(
       "review_month",
       col("total_review").alias("total_review_2016")
    ) 
    .join(
       total_review_2017
       .select("review_month",col("total_review")
       .alias("total_review_2017")),
       on="review_month"
    )
)
merged_20_16_17.show()
```

## Good practice in production environment

Good practice in a production environment#
The aggregation and subsetting of a DataFrame can be done through a **chain of function calls**, as shown below:


# DataFrames in global scope
The following code is an example of small DataFrames in the **global scope**, which should be converted into a series of functions so that we can avoid polluting the global scope:

```Python
total_review_by_mth_df = (
    main_df
    .groupBy('review_year','review_month') 
    .agg(fn.count(col("asin"))
    .alias("total_review")) 
    .orderBy('review_year', 'review_month')
)
total_review_2016 = total_review_by_mth_df.filter(col("review_year") == 2016)
total_review_2017 = total_review_by_mth_df.filter(col("review_year") == 2017)
merged_20_16_17 = (
    total_review_2016
    .select(
       "review_month",
       col("total_review").alias("total_review_2016")
    ) 
    .join(
       total_review_2017
       .select("review_month",col("total_review")
       .alias("total_review_2017")),
       on="review_month"
    )
)
merged_20_16_17.show()
```

# Good practice in production environment

Good practice in a production environment#
The aggregation and subsetting of a DataFrame can be done through a **chain of function calls**, as shown below:


Learn about the good practices in pandas and PySpark.


Avoid Global Scope

Learn about the good practices in pandas and PySpark.

Introduction

Data Input/Output

Data Transformation

User Defined Function (UDF)

Wrapping Up

Appendix

Avoid Global Scope

DataFrames in global scope

Good practice in production environment