Search⌘ K

Finding the most frequent words by Shakespeare

Explore advanced Bash techniques for mining Shakespearean plays and poems. Learn how to extract, combine, and sort text data to identify the most frequent words across multiple works by Shakespeare using shell scripting.

We'll cover the following...

Given a text, what are the most frequent words?

Finding the most frequent words for a given text (e.g., Knight_of_the_Burning_Pestle) is easy, we can build a function toptokens(), which is nothing but the topcrimes() function developed in our previous project. Let’s watch the following video lecture first:

Video thumbnail
Video lecture: Finding the most frequent words by Shakespeare (complex)

For example, if we want to grab the most frequent words in the Romeo and Juliet play, we can execute the following:

Shell
toptokens() { cat $1 | \
csvcut -c "tokens",$2 | \
sort -nr -t "," -k 2 | \
head -n 20 | \
awk -F',' '{print $1 "," $2}' ; }
toptokens plays_and_poems_stat.csv "Romeo_and_Juliet___play___Shakespeare" | csvlook
The top 20 frequent words in the work "Romeo and Juliet"
The top 20 frequent words in the work "Romeo and Juliet"

Given an author, what are the most frequent words?

This is slightly complicated! becuase we again need to perform several steps:

  • For the given author, trim out the plays/ poems names,
...