Search⌘ K
AI Features

Text and Token Classification

Explore how text and token classification work in natural language processing using Hugging Face transformers. Understand tokenization, embeddings, logits, and model outputs to apply classification for tasks like intent detection, NER, and zero-shot classification with hands-on Python examples.

Classification is one of the central pillars of natural language processing.

Whenever an ML model attempts to determine the category, meaning, tone, or role of something, it is performing classification. This seemingly simple operation powers most real-world NLP applications, including filtering spam, detecting toxic content, routing customer support tickets, analyzing sentiment in user reviews, and even determining whether a sentence contradicts or supports another.

Broadly, classification appears in two forms: text classification (labeling the entire passage) and token classification (labeling individual words or subwords). While the two may seem similar, they address very different problems and require distinct modeling strategies.

Fun fact: The earliest classification systems in NLP in the 1990s were rule-based and brittle. Today’s transformer models outperform them by enormous margins without needing handcrafted rules.

How classification models actually work

Most tutorials demonstrate pipelines without explaining how the model arrives at its prediction. Understanding the core components: tokenization, transformer layers, logits, and softmax, will dramatically improve your ability to debug, interpret, and optimize your models.

Tokenization

When text enters a model, it is first split into tokens using a tokenizer such as WordPiece, BPE, or SentencePiece.

Tokenization involves more than splitting text by spaces or punctuation. Modern tokenizers divide words into smaller subword units, which helps models understand rare or unfamiliar words without increasing the vocabulary size.

For example, “unbelievable” might be tokenized to:

"unbelievable" → ["un", "##believable"]
Subword tokenization example

Subword tokenization offers a robust approach to handling misspellings, compound words, and morphological variants, while maintaining a manageable vocabulary size.

This also affects how you interpret token-level outputs—an entity may appear across multiple sub-tokens and must be aggregated. Tokenization intersects with linguistic preprocessing techniques, such as lemmatization and stemming.

The former reduces words to their base or dictionary form, while the latter chops off word endings. For example, "running" and "better" could be stemmed and lemmatized, respectively, as follows.

"running" → "run"
"better" → "good"
Stemming and lemmatization example
1.

Why do tokenizers break words into subwords?

Show Answer
Did you find this helpful?

Embedding + Transformer layers

After tokenization, each token is converted into a dense vector (an embedding).

These vectors pass through stacked transformer layers where self-attention computes how each token should attend to every other token in the input. This is what gives transformers their power: they create contextualized token representations, meaning the embedding for a token depends on the entire sentence.

Due to their self-attention mechanism, transformers can handle long-range dependencies, resolve ambiguities, and detect constructs like negation and sarcasm that rely on distant words. Practically, this is why a model can understand that “not bad” implies a positive sentiment even though the word “bad” is present.

Fun fact: Self-attention was originally inspired by memory networks and alignment models in translation.

Logits

At the top of the network, the model outputs logits, which are raw numerical scores for each possible label.

Logits are not probabilities; they are unbounded numbers that the model uses internally to rank classes. The relative differences between logits are what matter. If you’re building thresholds, calibrating models, or diagnosing why a model is uncertain, examining logits is often more informative than simply examining the final label.

Softmax + Label mapping

Logits are converted to a pseudo-probability distribution using the softmax function. The class with the highest resulting probability becomes the predicted label, and Hugging Face returns a compact dictionary such as:

{'label': 'POSITIVE', 'score': 0.9987}

Knowing this final step makes it easier to reason about confidence: a high score near 1.0 suggests strong model agreement, while values closer to 0.5 indicate uncertainty and a need for caution (e.g., human review).

Fun fact: The softmax function originates from statistical mechanics and models the distribution of energy across states.

1.

Why should we care about logits if we only need labels?

Show Answer
Did you find this helpful?

Text classification

Text classification assigns one (or sometimes several) labels to an entire passage. This is a high-level operation: the model uses context across the whole input to make a single decision that summarizes intent, sentiment, topic, or action. Because it treats the input holistically, text classification is most effective when the label depends on the full text rather than a specific word or phrase.

Examples and explanation

  • Intent detection: Chatbots use text classification to map user utterances to intents like reset_password or check_balance. Intent models must be sensitive to short conversational phrases and be robust to paraphrase.

  • Topic classification: Newsrooms and aggregators classify long documents to route content to vertical teams. Topic models may require domain-specific vocabularies and sometimes hierarchical labels.

  • Toxicity detection: Moderation systems use classifiers to flag abusive language; often, these systems are multi-label because a single text can be insulting, hateful, and sexual all at once.

Note about models: Always pick a model aligned with your task. For example, a sentiment model fine-tuned on movie reviews may underperform on social media text; a model trained for toxicity on social networks will better capture abusive slang and emoji usage.

1.

Why do toxicity models return multiple labels instead of one?

Show Answer
1 / 2

Natural language inference (NLI)

NLI is the task of deciding whether a hypothesis can be inferred (entailed), contradicted, or remains neutral given a premise.

It is a reasoning-style classification rather than a surface-level label. NLI systems are extremely useful in fact-checking, determining whether an answer is supported by a source text, and improving search ranking by checking semantic fit.

Where NLI shines is in verification: rather than generating an answer, an NLI model judges the relationship between two pieces of text. Because of this, trained MNLI (the "M" stands for Multi-Genre) models (like roberta-large-mnli) are effective building blocks for zero-shot classification as well.

nli_pipeline = pipeline('text-classification', model='roberta-large-mnli')
premise = "The man is playing a guitar."
hypothesis = "The man is making music."
nli_input = f"{hypothesis} </s></s> {premise}"
result = nli_pipeline(nli_input)
print("NLI Result:", result)
Initializing an NLI pipeline

Explanation: The MNLI model determines the relationship between two sentences: it combines the hypothesis and premise, feeds them into the RoBERTa MNLI model, and outputs whether the hypothesis entails, contradicts, or is neutral with respect to the premise.

The result shows the predicted label and confidence score.

Fun fact: NLI models are a hidden backbone in many search and retrieval systems. They help filter which documents truly support a query, rather than just containing similar keywords.

Zero-shot classification

Zero-shot classification uses models trained on NLI-style tasks to score candidate labels without any task-specific training.

You provide human-readable category names and the model ranks them by semantic fit. This is transformative for rapid prototyping and production environments where labeled data is expensive or slow to produce. The trick is that NLI-trained models can compare a label description to the input text in a semantically meaningful way.

When to use zero-shot: When you want flexible categories, need to prototype quickly, or expect labels to change frequently. It is not always as accurate as a model fine-tuned for a specific domain, so consider validation on a small sample if this will drive production decisions.

zero_shot_classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
candidate_labels = ["finance", "sports", "politics", "technology"]
result = zero_shot_classifier("Tesla plans to open a new factory in Mexico.", candidate_labels)
print("Zero-Shot Classification Result:", result)
Initializing a zero-shot classifier

Explanation: Zero-shot classification assigns a text to one or more labels without needing task-specific training. The input sentence is compared against the candidate_labels, and the model predicts which label(s) best describe the text, returning a score for each.

1.

Why is zero-shot classification a breakthrough?

Show Answer
Did you find this helpful?

Token classification

Token classification assigns labels to each token (word or subword). This is a local task: instead of summarizing the whole input, token classifiers identify and label important spans inside the input. Token classification is the right choice when you need structured outputs such as names, dates, monetary amounts, and other discrete pieces of information.

Named entity recognition (NER)

NER is the prototypical token classification task.

Modern NER models identify spans representing people, organizations, locations, dates, and more, and return character offsets, allowing you to extract exact text spans. Aggregation strategies (e.g., "simple") merge subword predictions into whole-word spans so the output is human-friendly. The "simple" aggregation strategy combines consecutive tokens with the same entity label into a single entity.

For instance, "John Doe works in New York" would combine "John" and "Doe" into a single entity "John Doe" (Person).

ner_pipeline = pipeline('token-classification', model='dslim/bert-base-NER', aggregation_strategy='simple')
ner_text = "Barack Obama visited Paris during the G20 summit."
entities = ner_pipeline(ner_text)
print("NER Entities:", entities)
NER pipeline with aggregation

Explanation: This pipeline identifies and extracts entities from text.

The model labels tokens (like names, locations, organizations), and the aggregation_strategy='simple' merges consecutive tokens of the same entity type, returning a clean list of recognized entities with their labels. Use cases include extracting patient names and medications from medical records, retrieving invoice amounts and vendor names for accounting automation, and redacting personal data from documents to comply with privacy regulations.

Note: The first NER shared tasks in the early 2000s had accuracies in the 60–70% range. Transformers now exceed 90%+.

Part-of-Speech (PoS) tagging

PoS tagging labels grammatical roles. Although it is a classic NLP application, PoS tags remain valuable in pipelines that need linguistic structure, such as grammar checking, rule-based extraction, or downstreamDownstream systems are applications or tasks that use the output of an initial natural language processing model as their input. syntactic analysis.

Transformers outperform older statistical methods in this task due to their better context modeling capabilities.

1.

Is PoS tagging still useful with deep learning?

Show Answer
Did you find this helpful?

Choosing between text and token classification

Choosing the correct classification level depends on the information you need.

If your goal is to route, label, or summarize documents, text classification is the right abstraction. If you need to extract structured fields or redact personal data, token classification is required. Often, the best systems are hybrid: a text classifier first determines the document type, and then token-level models extract the fields that matter for that type.

Feature

Text Classification

Token Classification

Label target

Whole sentence/document

Individual words/subwords

Best for

Sentiment, topics, intent, NLI

NER, PoS, entity extraction

Granularity

Global

Fine-grained

Output format

Single label or distribution

List of labeled spans

Typical use case

Email filtering, toxicity detection

Invoice parsing, medical text extraction

Can they be combined

Yes

Yes

Selecting the right models

Selecting a model requires balancing domain alignment, language coverage, latency, and compute budget.

General-purpose top performers in 2025 include DeBERTa-v3 and RoBERTa-large for accuracy, DistilBERT for speed, and XLM-R for multilingual coverage. However, domain-specific checkpoints (finance, legal, biomedical) can dramatically outperform general models on niche text.

Task Type

Recommended Models

Notes

General classification

  • microsoft/deberta-v3-large
  • roberta-large

Highest accuracy

Sentiment analysis

  • distilbert-base-uncased-finetuned-sst-2-english

Fast and lightweight

NLI / Zero-shot

  • roberta-large-mnli
  • facebook/bart-large-mnli
  • microsoft/deberta-v3-large

Best reasoning performance

NER

  • dbmdz/bert-large-cased-finetuned-conll03-english
  • microsoft/deberta-v3-base

Strong span accuracy

Multilingual

  • xlm-roberta-large

Robust for non-English text

Low-latency apps

  • distilbert-base-uncased
  • google/electra-small-discriminator

Ideal for real-time inference

Note: DistilBERT keeps around 97 percent of BERT’s performance while being ~40% smaller—an excellent trade-off for production.

1.

Should we use domain-specific models?

Show Answer
Did you find this helpful?

Try it yourself

You now have all the necessary code inside the accompanying Jupyter notebook. Run each cell step by step and observe what happens. As you execute the notebook, you will see how each pipeline behaves on real text.

Use this Answer to get the access token for Hugging Face. Add the token value in the first cell of the Jupyter notebook and then run all cells.

Please login to launch live app!

Summary

This lesson introduced the core ideas behind text and token classification with hands-on practice.

Running these examples equips you with practical experience across the most important Hugging Face classification workflows. By experimenting with different models, analyzing predictions, and evaluating performance on your own data, you’ll deepen both intuition and technical understanding. As you explore tokenization behavior and confidence scores, you’ll begin to recognize why certain errors occur and how to fix them.

This hands-on practice sets the foundation for more advanced NLP experimentation ahead.