How to determine the number of unique words in a file in PySpark
Problem Statement
If we have a text file, how do we find the total number of unique words in it?
Algorithm
The assumption here is that a single space separates the words.
The steps involved are as follows:
- Read the text file into memory.
- Split the text file into individual tokens (or words).
- Find the unique tokens and the count of unique tokens.
Code
main.py
word_count.txt
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('Educative_Answers').getOrCreate()sc = spark.sparkContextf_path = "word_count.txt"f_rdd = sc.textFile(f_path)words_rdd = f_rdd.flatMap(lambda line: line.split(' '))distinct_words_rdd = words_rdd.distinct()print("The unique words in the file are as follows:", distinct_words_rdd.collect())count = distinct_words_rdd.count()print("The count of unique words in the file is:", count)
Explanation
- Lines 1–2: Import the
pysparkandSparkSession. - Line 4: We create a SparkSession with the application name Educative_Answers.
- Line 6: The spark context object is assigned to a variable
sc. - Line 8: The path to the text file is defined.
- Line 10: The file is read into spark
using theRDD Resilient Distributed Dataset textFile()method. - Line 12: We apply
flatMap()on the RDD to split the file into tokens. Here, we pass a lambda function that takes a line as the input and splits the text into words with space as the delimiter using thesplit()method. The resulting RDD will be the individual words of the text file. - Line 14: The unique words can be found by invoking the
distinct()function on the RDD. - Line 16: The unique words in the text file are printed.
- Line 18: The count of unique words is obtained by invoking the
count()function. - Line 20: The count of unique words is printed.
Free Resources
Copyright ©2026 Educative, Inc. All rights reserved