Trusted answers to developer questions

What is Pyspark and how to use it?

Free System Design Interview Course

Many candidates are rejected or down-leveled due to poor performance in their System Design Interview. Stand out in System Design Interviews and get hired in 2024 with this popular free course.

Pyspark example on databricks

Part 1: Load & Transform Data

In the first stage, we load some distributed data and read that data as a RDDResilient Distributed Dataset, do some transformations on that RDD, and construct a Spark DataFrame from that RDD and register it as a table.

1.1. List files

Files can be listed on a distributed file system (DBFS, S3, or HDFS) using %fs commands.

We are using data files stored in DBFSDatabricks File System at dbfs:/databricks-datasets/songs_pk/data_001 for this example.

DBFS is the system that strengthens AWS S3 and the SSD drives attached to Spark clusters hosted in AWS.

When approaching a file, it first checks if the file is cached in the SSDSolid State Drive. If it is not available, then it goes out to the specific S3 bucket to get the file(s) %fs ls /databricks-datasets/songs_pk/data_001/.


# Divide the header by its separator
header = sc.textFile(&quot;/databricks-datasets/songs_pk/data_001/header.txt&quot;)
    .map(lambda line: line.split(&quot;:&quot;)).collect()
# Create the Python function
def parse_Line(line):
    tokens = zip(line.split(&quot;\t&quot;), header)
    parsed_tokens = []
    
    for token in tokens:
        token_type = token[1][1]
        if token_type == & # 39; double& # 39;:
            parsed_tokens.append(float(token[0]))
        elif token_type == & # 39; int& # 39;:
            parsed_tokens.append(-1 if &# 39; -& # 39; in token[0] else int(token[0]))
        else:
            parsed_tokens.append(token[0])
    return parsed_tokens

RELATED TAGS

pyspark

python

apache spark

api

CONTRIBUTOR

AKASH BAJWA

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

What is Pyspark and how to use it?

What is Pyspark?

Understanding Spark’s features

1. Spark SQL & DataFrame

2. Streaming

3. MLlib

4. Spark Core

5. DataBricks

Pyspark example on databricks

Part 1: Load & Transform Data

1.1. List files

1.2. Display contents of the header

1.3. Examine a data file

1.4. Create python function to parse fields

1.5. Convert header structure

1.6. Create a DataFrame

1.7. Create a temp table

1.8. Cache the table

1.9. Query the data

Part 2: Explore & Visualize the Data

2.1. Display table schema

2.2. Get table rows count

2.3. Visualize a data point: Song duration changes with time