Search⌘ K
AI Features

Generating Data for GloVe

Explore how to generate data for the GloVe word embedding model using the BBC news articles dataset. Understand the process of creating batches of (target, context) word pairs, computing log co-occurrence values, and sample weights to feed into GloVe training. Learn to shuffle and batch data effectively to build robust vector representations for NLP tasks.

We'll cover the following...

We’ll be using the BBC news articles dataset. It contains 2,225 news articles belonging to five topics, which include business, entertainment, politics, sports, and tech, and were published on the BBC website between 2004 and 2005.

The glove_data_generator() function

Let’s now generate the data. We’ll be encapsulating the data generation in a function called glove_data_generator(). As the first step, let us write a function signature:

def glove_data_generator(sequences, window_size, batch_size, vocab_size, cooccurrence_matrix, x_max=100.0, alpha=0.75, seed=None):

The function takes several arguments:

  • sequences (List[List[int]]): This is a list of a list of word IDs. This is the output generated by the tokenizer’s texts_to_sequences() function.
  • window_size (int): This is the window size for the context.
  • batch_size (int): This is the batch size.
  • vocab_size (int): This is the vocabulary size.
  • cooccurrence_matrix (scipy.sparse.lil_matrix): This is a sparse matrix containing co-occurrences of words.
  • x_max (int): This is the hyperparameter used by GloVe to compute sample weights.
  • alpha (float): This is the hyperparameter used by GloVe to compute sample weights.
  • seed: This is the random seed.

It also has several outputs:

  • A batch of (target, context) word ID tuples.
  • The corresponding log(Xij)log(X_{ij})
...