Overview of PyCaret's NLP module
A module is a building block for the creation of the experiments. This shot provides an overview of the NLP module, which is available in PyCaret. Pycaret has different modules under different categories.
Every module has specific machine learning algorithms and functions that are used across different modules. For example, the create_model function trains and evaluates a model in all the modules.
Under supervised machine learning, PyCaret has the following modules for:
- Classification
- Regression
Under unsupervised machine learning, PyCaret has the following modules for:
- Clustering
- Anomaly detection
- Natural language processing
- Association rule mining
There are also time series and dataset modules in PyCaret.
Depending on our use case, we can use any of these PyCaret modules. A flow chart of these modules is given below:
We will only discuss the Natural Language Processing (NLP) module here. The sections that we will cover are:
- Importing necessary libraries
- Importing the dataset
- Conducting exploratory data analysis
Importing necessary libraries
First, we will import the necessary libraries required for the NLP module.
from pycaret import nlpfrom pycaret import classificationfrom wordcloud import WordCloudfrom spacy.lang.en.stop_words import STOP_WORDSimport matplotlib.pyplot as pltimport pandas as pdimport matplotlib as mplmpl.rcParams['figure.dpi'] = 300
Loading the dataset
After importing the libraries, we will use the read_csv() function to import the required dataset. Then, we will use the head() function to print the first ten instances of the dataset.
data = pd.read_csv('bbc-text.csv')data.head(10)
The dataset information is printed through the info() function.
data.info()
Exploratory data analysis
We will now perform an exploratory data analysis on the dataset we imported earlier.
Bar chart
As we can see in the output below, the bar chart of the data is plotted:
colour = ['C5', 'C6', 'C7', 'C8', 'C9']category = data['category'].value_counts()category.plot(kind = 'bar', figsize = (10,7), color = colour)plt.show()
Code explanation
-
Line 1: We create a list of colors for every category.
-
Line 3: We count the total number of categories.
-
Line 4: We plot the categories in a bar chart, using the
plot()function. -
Line 6: We display the plot, using the
plt.show()function.
WordCloud
We draw the WordCloud of the data. WordCloud is a collection of many words depicted in different sizes. The word which is bolder and bigger in the WordCloud has greater frequency in the text and is more important.
wcloud = WordCloud(width = 1600, height = 1000, stopwords = STOP_WORDS,background_color = 'white', min_word_length = 3, max_words = 120)data = data.query(" category == 'tech' ")['text']text = ' '.join(data.to_list())wcloud_img = wcloud.generate(text)plt.figure(figsize = (12,8))plt.imshow(wcloud_img, interpolation = 'bilinear')plt.axis("off")plt.show()
Code explanation
-
Lines 1–2: We use the
WordCloud()function to create a WordCloud from a text. -
Lines 4–6: We generate the WordCloud using the
generate()function. -
Lines 8–11: We display a plot of the WordCloud that we generated.
Horizontal bar chart
As we can see in the output below, the horizontal bar chart of the data was plotted.
txt = ' '.join(data['text'].to_list())frequency = wcloud.process_text(txt)df_frequency = pd.DataFrame.from_dict(frequency, orient='index', columns=['frequency'])df_frequency = df_frequency.sort_values('frequency')df_frequency[-20:].plot(kind = 'barh', figsize = (10,8))plt.show()
Code explanation
-
Lines 1–4: We count the frequency of words from the dictionary made from the text.
-
Line 5: We plot these frequencies in a horizontal bar chart.
-
Line 7: We display the horizontal plot using the
plt.show()function.
Free Resources