Search⌘ K

Initializing the NLP Environment

Explore how to initialize the NLP environment in PyCaret to effectively preprocess text data. Understand essential steps such as removing numeric characters and stopwords, tokenizing words, extracting bigrams and trigrams, and applying lemmatization to prepare your dataset for natural language processing tasks.

Now we’ll initialize the PyCaret NLP environment and create the transformation pipeline by using the setup() function. The target parameter lets us specify the dataset’s text column, which will go through a number of preprocessing steps as described below. After this process is completed, the first 10 instances of the preprocessed dataset are printed.

# Initializing the NLP environment

nlp_ = nlp.setup(data = data, target='text', session_id = 6842)
data_ = nlp.get_config('data_')
data_.head(10)

Initializing the NLP environment

Numeric and special character removal

Numbers and punctuation are not informative in the context of natural language processing, so PyCaret removes all numeric and special characters from the corpus. Those unnecessary characters are replaced with spaces by using regular expressions.

Word tokenization

Tokenization is the process of splitting the corpus into tokens smaller units which are usually words. This is fundamental and typically one of the first steps in NLP because it ...