Given a PySpark DataFrame, we can select the columns based on a regex using the function colRegex
in PySpark.
DataFrame.colRegex(colName: str)
colName
: This represents a string or a column name specified as a regex.import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('edpresso').getOrCreate() data = [("James","Smith","USA","CA"), ("Michael","Rose","USA","NY"), ("Robert","Williams","USA","CA"), ("Maria","Jones","USA","FL") ] columns = ["firstname","lastname","country","state"] df = spark.createDataFrame(data = data, schema = columns) df.select(df.colRegex("`.+name$`")).show()
Note: If you see any warnings in the output, please ignore them.
Lines 1–2: We import pyspark
and SparkSession
.
Line 4: We create SparkSession
with the application name edpresso
.
Lines 6–10: We define the dummy data for the DataFrame.
Line 12: We define the columns for the dummy data.
Line 13: We create a spark DataFrame with the dummy data in lines 6–10 and the columns in line 13.
Line 14: We select the subset of the columns by using a regex. The regex matches the column names ending with the keyword name
. The colRegex()
method retrieves all columns ending with a name
.
RELATED TAGS
CONTRIBUTOR
View all Courses