Introducing Tokenization
Explore tokenization in spaCy by understanding how text is split into tokens such as words, punctuation, and symbols. Learn the importance of language-specific rules behind tokenization and how spaCy creates Doc and Token objects. This lesson prepares you to handle text accurately in NLP tasks using spaCy.
We'll cover the following...
Tokenization is the first step in a text processing pipeline. It is always the first operation because all the other operations require the tokens.
Tokenization means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:
USA | NY |
city | 33 |
3rd | ! |
...? | 's |
Tokenization in spaCy
Input to the spaCy tokenizer is a Unicode text, and the result is a Doc object. The following code shows the tokenization process: