...

Getting to Know the Dataset

Let's have a look at the dataset we'll be using and try to understand it.

We'll cover the following...

Previously, we worked on well-known real-world datasets for text classification and entity extraction purposes. We always explore our dataset as the very first task. The main point of data exploration is to understand the nature of the dataset text in order to develop strategies in our algorithms that can tackle this dataset. We learned earlier that the following are the main points we should keep an eye on during our exploration:

What kind of utterances are there? Are utterances short text or full sentences or long paragraphs or documents? What is the average utterance length?
What sort of entities does the corpus include? Person names, organization names, geographical locations, street names? Which ones do we want to extract?
How is punctuation used? Is the text correctly punctuated, or is no punctuation used at all?
How are the grammatical rules followed? Is capitalization correct, and did the users follow the grammatical rules? Are there misspelled words?

The previous datasets we used consisted of (text, class_label) pairs to be used in text classification tasks or (text, ...

Getting Started

Core Operations with spaCy

Linguistic Features

Rule-Based Matchmaking

Working with Word Vectors and Semantic Similarity

Putting Everything Together: Semantic Parsing with spaCy

Assessment: spaCy Features

Auto-Tagging System for Content Categorization

Customizing spaCy Models

Text Classification with spaCy

spaCy and Transformers

Putting Everything Together: Designing a Chatbot with spaCy

Appendix

Conclusion

Assessment - Machine Learning with spaCy

Getting to Know the Dataset