Generating Data for GloVe

Explore how to generate data for the GloVe word embedding model using the BBC news articles dataset. Understand the process of creating batches of (target, context) word pairs, computing log co-occurrence values, and sample weights to feed into GloVe training. Learn to shuffle and batch data effectively to build robust vector representations for NLP tasks.

We'll cover the following...

The glove_data_generator() function

The function takes several arguments:

sequences (List[List[int]]): This is a list of a list of word IDs. This is the output generated by the tokenizer’s texts_to_sequences() function.
window_size (int): This is the window size for the context.
batch_size (int): This is the batch size.
vocab_size (int): This is the vocabulary size.
cooccurrence_matrix (scipy.sparse.lil_matrix): This is a sparse matrix containing co-occurrences of words.
x_max (int): This is the hyperparameter used by GloVe to compute sample weights.
alpha (float): This is the hyperparameter used by GloVe to compute sample weights.
seed: This is the random seed.

It also has several outputs:

A batch of (target, context) word ID tuples.
The corresponding $log(X_{ij})$

...

1.Introduction to Natural Language Processing

2.Understanding TensorFlow 2

3.Word2vec: Learning Word Embeddings

4. Advanced Word Vector Algorithms

5.Sentence Classification with Convolutional Neural Networks

6.Recurrent Neural Networks

7.Understanding Long Short-Term Memory Networks

8.Applications of LSTM: Generating Text

9.Sequence-to-Sequence Learning: Neural Machine Translation

10.Transformers

Project

11.Image Captioning with Transformers

12.Final Remarks

13.Appendix: Mathematical Foundations and Advanced TensorFlow

Mock Interview

Generating Data for GloVe

The `glove_data_generator()` function

Generating Data for GloVe

The glove_data_generator() function

The `glove_data_generator()` function