Text Preprocessing and Sentiment Analysis
Learn how to process text data using TensorFlow to train a JAX model.
We'll cover the following...
Text vectorization
Next, we use scikit-learn’s TfidfTransformer function to convert the text data to integer representations. The function expects the maximum number of features.
In the code above:
Lines 1–3: We import the required modules:
numpyfromjaxasjnp,TfidfVectorizerfromsklearn.feature_extraction.text, andtrain_test_splitfromsklearn.model_selection.Line 4: We create an instance of the
TfidfVectorizerclass with themax_featuresof10000.Line 5: We call the
fit_transforms()method of theTfidfVectorizerclass to convert thedocsinto TF-IDF values. We also call theto_array()method to convert these TF-IDF values into an array and store it in theXvariable.Lines 6: We use the
train_test_split()function to split the dataset (Xandlabels) into train (X_trainandy_train) and test (X_testandy_test) datasets.Lines 9–10: We convert the
X_trainandX_testinto JAX arrays.Lines 12–13: We print the
X_trainandX_test.
Next, we use TensorFlow’s TextVectorization() function to convert the text data to integer representations. The function expects that:
- We use
standardizeto specify how the text data is processed. For example, thelower_and_strip_punctuationoption will lowercase the data and remove punctuations. - We use
max_tokensto dictate the maximum size of the vocabulary. - We use
output_modeto determine the output of the vectorization layer. Theintsetting outputs integers. - We use
output_sequence_lengthto indicate the maximum length of the output sequence. This ensures that all sequences have the same length.
Preparing training and testing data
Next, we apply this layer to the training and testing data.
Let’s convert the data to a TensorFlow dataset and create a function to fetch the data in batches. We’ll also convert the data to NumPy arrays because JAX expects NumPy or JAX arrays. Here, tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays.
In the code above:
Line 1: We import the
tensorflow_datasetsastfds.Lines 3–4: We call the
from_tensor_slices()method of thetf.data.Datasetmodule to create the TensorFlowDatasetobjects for training and testing datasets. We named the test data as validation data.Lines 5–6: We call ...