Natural Language Processing, commonly known as NLP, is a field that is rapidly evolving. This branch of machine learning uses software manipulate and create natural language such as speech and text.
With the advent of AI bots like Siri, Cortana, Alexa, and Google Assistant, the use of NLP has increased many folds. People are trying to build models that can better understand human languages, formally known as Natural Languages.
The most common uses of Natural Language Processing in our daily life are search engines, machine translation, chatbots, and home assistants.
Today, we’ll go over the basics of NLP using Python and discuss some of the main trends in the industry.
We will cover:
Learn how to use Tensorflow and Pandas to build NLP programs and short-term memory networks.
Natural language processing (NLP) is one of the most important tasks in the current industry that uses machine learning concepts. NLP deals with anything related to using machines to process and understand human text/speech, which we call Natural Languages.
Tasks such as translating between languages, speech recognition, text analysis, and automatic text generation all fall under the scope of NLP. Let’s define the two terms Natural Language and Natural Language Processing in a more formal way.
Natural Language: A language that has developed naturally in humans.
Natural Language Processing: The ability of a computer program to understand human languages as it is spoken. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a valuable way.
Natural language deals with two categories of data: spoken and written data. Written data, like text, is more prevalent in NLP tasks, but raw text data is usually unusable in NLP applications. An engineer must first convert the raw text data into usable machine data. That machine data is then fed as an input for an NLP algorithm.
NLP basically includes two important parts: Natural Language Understanding and Natural Language Generation.
Natural Language Understanding means that a machine learning or deep learning model is able to understand the language spoken by humans. In other words, the system is able to comprehend the sentences spoken or written by us. If a system is able to understand the natural language, then it is able to reply to our answers.
It can be used to solve many real-world problems like Question-Answer, Query resolution, Sentiment Analysis, Similarity detection in texts, and Chatbots.
Natural Language Generation, on the other hand, is the ability of a machine learning model to generate output in the form of text or audio that is similar to human language. In this task, we generate sentences from predefined text datasets using the model.
It is used for summarization of text, replying to queries or questions, machine translation, and generation of answers.
In the past years, many advances have been made in the field of NLP. This has been possible due to increased resources in the form of large text datasets, Cloud platforms for the training of large models, etc. But the most important factor is the discovery of transformers and the use of Transfer Learning.
Now, the models are pre-trained on large dataset. This pre-trained model is adjusted with parameters to solve the required task. The pre-trained model is fine-tuned to do tasks like text classification, part-of-speech tagging, named entity recognition, summarization of text, and question-answering, etc.
NLP deals with applying algorithms that extract the rules of a natural language and covert it so a computer can understand. We first provide the text, and a computer uses algorithms to extract meaning.
Many different techniques are used for this process, including:
When it comes to written data, we use a text corpus and tokenization. A text corpus is essentially our vocabulary. We can use character-based or word-based vocabularies, which are more popular.
Then, we need to analyze how many times a word appears in a corpus. We do this by representing the text data as a vector of words. This process is called tokenization.
We use a tokenizer object to covert a text corpus into sequences. This is done with the ML tool TensorFlow. This tool essentially converts each vocabulary word to an integer ID based by descending frequency.
Learn how to solve NLP problems without scrubbing through videos or documentation. Educative’s text-based courses are easy to skim and feature live coding environments, making learning quick and efficient.
Now that we’re familiar with the basics of NLP, let’s dive into the top 13 advances that we’ve seen in the field. This should help to familiarize you with NLP and show you what this amazing technology can do.
“Attention is all you need” is a research paper published by Google AI employees in June 2017. Ashish Vaswani et al. published this paper and revolutionized the NLP industry. It was the first time the concept of transformers was referenced.
Before this paper, RNN and CNN were used in the field of NLP but they had two problems:
RNNs were not able to deal with long-term dependencies even with different improvements like Bidirectional RNNs or LSTMs and GRUs. Transformers with self-attention came to the rescue of these problems and made a breakthrough in NLP. It was state-of-the-art for seq2seq models, which are used for language translation.
The other most important development was the use of transfer learning in the field of NLP. This language model, created by
fast.ai in May 2018, introduced the concept of transfer learning to the NLP community.
It is a single universal language model fine-tuned for multiple tasks. The same model can be fine-tuned to solve 3 different NLP tasks. AWD-LSTM forms the building block of this model, which stands for Asynchronous Stochastic Gradient Descent (ASGD) Weight Dropped.
Also created by Google AI teams in November 2018, this innovation uses the concept of both the above-mentioned advancements for bidirectional training of transformers. It is a state-of-the-art model for 11 NLP tasks. It is pre-trained on the whole English Wikipedia dataset, which consists of almost 2.5 billion words.
This model, also from Google AI (January 2019), outperformed even BERT in Language Modeling. It also resolved the issue of context fragmentation which was faced by the original transformers.
From their official site, StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse.
It contains pre-trained neural models for 53 human languages, thus increasing the scope of NLP to a global level instead of being constricted to just English.
GPT-2, created by OpenAI in February 2019, stands for “Generative Pre-trained Transformer 2”. As the name suggests, it is used for tasks concerned with the natural language generation part of NLP. This is the SOTA (state-of-the-art) model for text generation.
GPT-2 has the ability to generate a whole article based on small input sentences. It is also based on transformers. GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks. It is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test. This is known as the “zero-shot” setting.
Created by CMU AI in June of 2019, XLNet uses auto-regressive methods for language modeling instead of auto-encoding used in BERT. It uses the best features of both BERT and TransformerXL.
In July 2019, the folks at Hugging face have created a miracle by making PyTorch Transformers. With this tool, we can use BERT, XLNET, and TransformerXL models with very few lines of Python code.
In July 2019, the Chinese search giant Baidu made this model with the feature of continual pre-training. It is a pre-trained language understanding model that achieved state-of-the-art results and outperformed BERT and the recent XLNet in 16 NLP tasks in both Chinese and English.
In July 2019, FacebookAI released an improvement over BERT. The development team at FacebookAI optimized BERT’s training process and hyperparameters to achieve this model.
Hugging Face, in August 2019, released this PyTorch transformer for language processing. It is also used for the deployment of transformers. spaCy is used along with PyTorch to build the Transformers.
In August 2019, Facebook released this multilingual language model consisting of almost 100 languages. It is SOTA for cross-lingual classification and machine translation.
In April 2020, Standford University released this advanced version of the StanfordNLP that supports 66 languages. Stanza features a language-agnostic, fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Natural language processing is an exciting field, and it is continuing to grow. In the near future, every product will have ML components, so this is a lucrative career for anyone interested in ML.
If you want to continue learning about NLP, I recommend tackling the following concepts next:
To get started, check out Educative’s course Natural Language Processing with Machine Learning, which covers all these topics and more in Python code. After completing this course, you will be able to solve the important day-to-day NLP problems faced in industry.
Join a community of 500,000 monthly readers. A free, bi-monthly email with a roundup of Educative's top articles and coding tips.