How to implement FastText using Gensim

FastText is a lightweight library developed by Facebook AI Research (FAIR), specifically designed to facilitate the creation of scalable solutions for text representation and classification.

Notably, it operates efficiently on standard hardware, and its innovative features enable it to run even on smartphones and small computers by minimizing memory consumption. This capability to deliver powerful NLP functionalities while being resource-friendly distinguishes FastText as a versatile tool for a wide range of applications.

Gensim

Gensim is a free, open-source Python library designed to represent documents as semantic vectors efficiently and intuitively. It offers a range of algorithms, including Word2Vec, FastText, Latent Semantic Indexing (LSI), and Latent Dirichlet Allocation (LDA), among others. Gensim operates in an unsupervised manner by eliminating the need for human input. This means that users only require a corpus of plain text documents to unlock the semantic richness within them.

Gensim stands out for its practicality, prioritizing proven, battle-tested algorithms tailored to solve real industry challenges. Its focus lies more on engineering solutions rather than academic pursuits. Moreover, Gensim’s memory independence feature enables it to process large-scale corpora efficiently, as it does not necessitate the entire training corpus to reside fully in RAM at any given time. This capability makes it suitable for handling web-scale datasets using data streaming techniques.

It uses optimized math algorithms and computer languages like C, BLAS (Basic Linear Algebra Subprograms), and memory-mapping techniques for fast and reliable processing, even with large datasets. It’s a versatile tool trusted by experts to simplify and speed up the analysis of text data in natural language processing tasks.

Installation

To install Gensim, enter the following command:

pip install --upgrade gensim

This command will download and install Gensim and its dependencies from the Python Package Index (PyPI).

Implementation of FastText using Gensim

Let's see how we can train a FastText model. As an example to illustrate the process we are using a very small dataset of a few sentences. To ensure that the model is well trained and generates sufficiently good results, it is recommended that you use a larger dataset.

from gensim.models import FastText
from gensim.test.utils import common_texts
for i in range(len(common_texts)):
print(common_texts[i])
model = FastText(vector_size=4, window=3, min_count=1)
model.build_vocab(corpus_iterable=common_texts)
model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10) # train

Code explanation:

  • Line 1: We import the FastText model class from the Gensim library, which is used for training word embeddings.

  • Line 2: We import a set of common example sentences provided by Gensim. These sentences will be used as training data for the model.

  • Line 4–5: We print each sentence in the common_texts dataset. This is just for demonstration purposes to show the example sentences.

  • Line 7: We instantiate a FastText model with specific hyperparameters: vector size of 4, window size of 3, and minimum count of 1. These can be changed according to specific requirements.

  • Line 8: We build the vocabulary of the model based on the provided common_texts corpus.

  • Line 9: We train the FastText model on the common_texts corpus. It iterates over the corpus 10 times (epochs=10), updating the model parameters based on the training data.

Now let's look at some uses of the FastText model.

word_embedding = model.wv['computer']
print(word_embedding)

Code explanation:

  • The code above, we retrieve and print the word embedding vector for the word 'computer' as learned by the trained FastText model. This vector represents the semantic meaning of the word in a high-dimensional space.

One of the advantages of the FastText model is its ability to generate embeddings even for words that are not present in its vocabulary. Try replacing the word 'computer' with some random word that is not present in the common_texts dataset.

similarity = model.wv.similarity('computer', 'human')
print(similarity)

In the code above, we compute the cosine similarity between the word vectors of "computer" and "human". Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. Higher values indicate greater similarity.

similar_words = model.wv.most_similar('system', topn=3)
print(similar_words)

The code above is similar to the one in the previous example, except here we are looking for the top three words that are most similar to the word "system" in terms of their word embeddings. The similarity is also determined using cosine similarity between word vectors.

Benefits of the FastText model

Sure, here are the top five most important benefits of the FastText model:

  • Handling out-of-vocabulary words: FastText can generate embeddings for out-of-vocabulary words by breaking them down into character n-grams. This feature enables the model to handle rare and unseen words effectively, making it robust in real-world applications where vocabulary coverage is crucial.

  • Efficient training: FastText is computationally efficient, especially compared to more complex models like deep neural networks. It can be trained on large datasets relatively quickly, making it suitable for scenarios where training time and computational resources are limited.

  • Domain adaptability: FastText can be trained on specific domains or specialized corpora, enabling it to capture domain-specific semantics effectively. This flexibility makes FastText adaptable to a wide range of applications and domains, from general-purpose NLP tasks to domain-specific applications such as biomedical text mining or legal document analysis.

  • Multi-lingual support: FastText supports multiple languages and can generate embeddings for different languages using the same architecture. This feature makes FastText suitable for multilingual applications and enables transfer learning across languages, allowing models trained on one language to benefit from data in other languages.

These benefits collectively make FastText a versatile and powerful tool for various natural language processing tasks, offering efficiency, robustness, and effectiveness in capturing semantic relationships between words.

Conclusion

FastText is a powerful and efficient tool for text representation and classification, developed by Facebook AI Research. Its lightweight nature allows it to run on standard hardware, including smartphones, making it a versatile choice for various applications. Gensim, on the other hand, is a robust open-source library that offers a suite of algorithms for text analysis, including FastText, and is designed for practical, large-scale data processing. Both tools are invaluable in the realm of natural language processing, providing efficient solutions for text analysis and representation.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved