Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

pycaret modules

Overview of PyCaret's NLP module

Educative Team

A module is a building block for the creation of the experiments. This shot provides an overview of the NLP module, which is available in PyCaret. Pycaret has different modules under different categories.

Every module has specific machine learning algorithms and functions that are used across different modules. For example, the create_model function trains and evaluates a model in all the modules.

Under supervised machine learning, PyCaret has the following modules for:

  • Classification
  • Regression

Under unsupervised machine learning, PyCaret has the following modules for:

  • Clustering
  • Anomaly detection
  • Natural language processing
  • Association rule mining

There are also time series and dataset modules in PyCaret.

Depending on our use case, we can use any of these PyCaret modules. A flow chart of these modules is given below:

PyCaret modules

We will only discuss the Natural Language Processing (NLP) module here. The sections that we will cover are:

  1. Importing necessary libraries
  2. Importing the dataset
  3. Conducting exploratory data analysis

Importing necessary libraries

First, we will import the necessary libraries required for the NLP module.

from pycaret import nlp
from pycaret import classification
from wordcloud import WordCloud
from spacy.lang.en.stop_words import STOP_WORDS
import matplotlib.pyplot as plt
import pandas as pd 
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
Importing libraries

Loading the dataset

After importing the libraries, we will use the read_csv() function to import the required dataset. Then, we will use the head() function to print the first ten instances of the dataset.

data = pd.read_csv('bbc-text.csv')

Load the dataset

The dataset information is printed through the info() function.
Dataset info

Exploratory data analysis

We will now perform an exploratory data analysis on the dataset we imported earlier.

EDAExploratory Data Analysis is a method that uses descriptive statistics and visualization to help us understand the properties of a particular dataset. It is an important part of every machine learning or data science project, as we have to understand the dataset before we can utilize it.

Bar chart

As we can see in the output below, the bar chart of the data is plotted:

colour = ['C5', 'C6', 'C7', 'C8', 'C9']

category = data['category'].value_counts()
category.plot(kind = 'bar', figsize = (10,7), color = colour)
Bar chart

Code explanation

  • Line 1: We create a list of colors for every category.

  • Line 3: We count the total number of categories.

  • Line 4: We plot the categories in a bar chart, using the plot() function.

  • Line 6: We display the plot, using the function.


We draw the WordCloud of the data. WordCloud is a collection of many words depicted in different sizes. The word which is bolder and bigger in the WordCloud has greater frequency in the text and is more important.

wcloud = WordCloud(width = 1600, height = 1000, stopwords = STOP_WORDS,
     background_color = 'white', min_word_length = 3, max_words = 120)

data = data.query(" category == 'tech' ")['text']
text = ' '.join(data.to_list())
wcloud_img = wcloud.generate(text)

plt.figure(figsize = (12,8))
plt.imshow(wcloud_img, interpolation = 'bilinear')

Code explanation

  • Lines 1–2: We use the WordCloud() function to create a WordCloud from a text.

  • Lines 4–6: We generate the WordCloud using the generate() function.

  • Lines 8–11: We display a plot of the WordCloud that we generated.

Horizontal bar chart

As we can see in the output below, the horizontal bar chart of the data was plotted.

txt = ' '.join(data['text'].to_list())
frequency = wcloud.process_text(txt)
df_frequency = pd.DataFrame.from_dict(frequency, orient='index', columns=['frequency'])
df_frequency = df_frequency.sort_values('frequency')
df_frequency[-20:].plot(kind = 'barh', figsize = (10,8))
Horizontal bar chart

Code explanation

  • Lines 1–4: We count the frequency of words from the dictionary made from the text.

  • Line 5: We plot these frequencies in a horizontal bar chart.

  • Line 7: We display the horizontal plot using the function.

The remaining sections:

  • Create model
  • Tune model
  • Assign model
  • Evaluate model
  • Save model

can be explored with the help of the official documentation for the NLP module.


pycaret modules
Copyright ©2022 Educative, Inc. All rights reserved

View all Courses

Keep Exploring