Getting to Know the Dataset

Let's have a look at the dataset we'll be using and try to understand it.

Previously, we worked on well-known real-world datasets for text classification and entity extraction purposes. We always explore our dataset as the very first task. The main point of data exploration is to understand the nature of the dataset text in order to develop strategies in our algorithms that can tackle this dataset. We learned earlier that the following are the main points we should keep an eye on during our exploration:

  • What kind of utterances are there? Are utterances short text or full sentences or long paragraphs or documents? What is the average utterance length?

  • What sort of entities does the corpus include? Person names, organization names, geographical locations, street names? Which ones do we want to extract?

  • How is punctuation used? Is the text correctly punctuated, or is no punctuation used at all?

  • How are the grammatical rules followed? Is capitalization correct, and did the users follow the grammatical rules? Are there misspelled words?

The previous datasets we used consisted of (text, class_label) pairs to be used in text classification tasks or (text, list_of_entities) pairs to be used in entity extraction tasks. Now we'll tackle a much more complicated task, chatbot design. Hence, the dataset will be more structured and more complicated.

Chatbot design datasets are usually in JSON format to maintain the dataset structure. Here, structure means the following:

  • Keeping the order of user and system utterances

  • Marking slots of the user utterances

  • Labeling the intent of the user utterances

Throughout this chapter, we'll use Google Research's The Schema-Guided Dialogue dataset (SGD). This dataset consists of annotated user-virtual assistant interactions. The original dataset contains over 20,000 dialogue segments in several areas, including restaurant reservations, movie reservations, weather queries, and travel ticket booking. Dialogues include utterances of the user and virtual assistant turn by turn. We won't use all of this massive dataset; instead, we'll use a subset about restaurant reservations.

Let's get started with loading the dataset.

Get hands-on with 1200+ tech skills courses.