NLP and Python

Let's get a brief overview of what NLP is and why we use Python for it.

We'll cover the following

Rise of NLP

Over the past few years, many branches of AI have created a lot of buzz, including NLP, computer vision, and predictive analytics. But just what is NLP? How can a machine or code solve human language?

NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. Human language is complicated—even a short paragraph contains references to the previous words, pointers to real-world objects, cultural references, and the writer's or speaker's personal experiences. The figure below shows such an example sentence, which includes a reference to a relative date (recently), phrases that can be resolved only by another person who knows the speaker (regarding the city that the speaker's parents live in), and who has general knowledge about the world (a city is a place where human beings live together):

An example of human language, containing many cognitive and cultural aspects
An example of human language, containing many cognitive and cultural aspects

How do we process such a complicated structure then? We have our weapons too; we model natural language with statistical models, and we process linguistic features to turn the text into a well-structured representation. This course provides all the necessary background and tools for learners to extract the meaning from the text. By the end of this course, you will possess statistical and linguistic knowledge to process text by using a great tool—the spaCy library.

Though NLP gained popularity recently, processing human language has been present in our lives via many real-world applications, including search engines, translation services, and recommendation engines.

Search engines such as Google Search, Yahoo Search, and Microsoft Bing are integral to our daily lives. We look for homework help, cooking recipes, information about celebrities, the latest episodes of our favorite TV series, and all sorts of information we use in our daily lives. There is even a verb in English (also in many other languages), to google, meaning to look up some information on the Google search engine.

Search engines use advanced NLP techniques, including mapping queries into a semantic space, where similar vectors represent similar queries. A quick trick is called autocomplete, where query suggestions appear on the search bar when we type the first few letters. Autocomplete looks tricky, but indeed the algorithm is a combination of a search tree walk and character-level distance calculation. A past query is represented by a sequence of its characters, where each character corresponds to a node in the search tree. The arcs between the characters are assigned weights according to the popularity of this past query.

Then, when a new query comes, we compare the current query string to past queries by walking on the tree. A fundamental Computer Science (CS) data structure, the tree, is used to represent a list of queries; who would have thought that? The figure below shows a walk on the character tree:

This is a simplified explanation; the real algorithms usually blend several techniques and are far more complex.

Continuing with search engines, search engines also know how to transform unstructured data into structured and linked data. When we type Diana Spencer (at the time of the creation of this course) into the search bar, this is what comes up:

How did the search engine link Diana Spencer to her well-known name Princess Diana? This is called entity linking. We link entities that mention the same real-world entity. Entity-linking algorithms are concerned with representing semantic relations and knowledge in general. This area of NLP is called the Semantic Web.

There is really no limit to what we can develop: search engine algorithms, chatbots, speech recognition applications, and user sentiment recognition applications. NLP problems are challenging yet fascinating. This course's mission is to provide learners with a toolbox with all the necessary tools. The first step of NLP development is choosing the programming language we will use wisely. In this course, our weapon of choice is Python. Let’s see the strong bond between NLP and Python.

NLP with Python

As we remarked before, NLP is a subfield of AI that analyzes text, speech, and other forms of human-generated language data. For industry professionals, the first choice for manipulating text data is Python. In general, there are many benefits to using Python:

  • It is easy to read and looks very similar to pseudocode.

  • It is easy to produce and test code with.

  • It has a high level of abstraction.

Python is a great choice for developing NLP systems because of the following:

  • Simplicity: Python is easy to learn. You can focus on NLP rather than the programming language details.

  • Efficiency: It allows for easier development of quick NLP application prototypes.

  • Popularity: Python is one of the most popular languages. It has huge community support, and installing new libraries with pip is effortless.

  • AI ecosystem presence: A significant number of open-source NLP libraries are available in Python. Many machine learning (ML) libraries, such as PyTorch, TensorFlow, and Apache Spark, also provide Python APIs.

  • Text methods: String and file operations with Python are effortless and straightforward. For example, splitting a sentence at the whitespaces requires only a one-liner, sentence.split(), which can be quite painful in other languages, such as C++, where you have to deal with stream objects for this task.

When we put all the preceding points together, the following image appears—Python intersects with string processing, the AI ecosystem, and ML libraries to provide us with the best NLP development experience:

NLP with Python overview
NLP with Python overview

Note: We will use Python 3.5+ throughout this course.

In Python 3.x, the default encoding is Unicode, which means that we can use Unicode text without worrying much about the encoding. We won't go into details of encodings here, but you can think of Unicode as an extended set of ASCII, including more characters such as German-alphabet umlauts and the accented characters of the French alphabet. This way, we can process German, French, and many more languages other than English.