How to determine the number of unique words in a file in PySpark

Problem Statement

If we have a text file, how do we find the total number of unique words in it?

The assumption here is that a single space separates the words.

The steps involved are as follows:

Lines 1–2: Import the pyspark and SparkSession.
Line 4: We create a SparkSession with the application name Educative_Answers.
Line 6: The spark context object is assigned to a variable sc.
Line 8: The path to the text file is defined.
Line 10: The file is read into spark RDDResilient Distributed Dataset using the textFile() method.
Line 12: We apply flatMap() on the RDD to split the file into tokens. Here, we pass a lambda function that takes a line as the input and splits the text into words with space as the delimiter using the split() method. The resulting RDD will be the individual words of the text file.
Line 14: The unique words can be found by invoking the distinct() function on the RDD.
Line 16: The unique words in the text file are printed.
Line 18: The count of unique words is obtained by invoking the count() function.
Line 20: The count of unique words is printed.