Trusted answers to developer questions

What is Data Block API in fastai?

Get Started With Machine Learning

Learn the fundamentals of Machine Learning with this free course. Future-proof your career by adding ML skills to your toolkit — or prepare to land a job in AI or Data Science.

If you have used any Deep Learning framework (I use PyTorch) to build a model to solve a Deep Learning problem, you would have gone through the steps of collecting data, seen what type of problem it is(e.g., image classification, segmentation, etc.), seen what the dependent and independent variables are, seen how to split the data into training and validation sets, and applied transforms to improve accuracy.

And in that process, you may also have written lengthy code. However, what if I told you you could do all that in one single block?

Data Block API

Data block API is a high-level API in fastai that is an expressive API for data loading. It is a way to systematically define all of the steps necessary to prepare data for a Deep Learning model, as well as, give users a “mix and match” recipe bookwe refer to this as the data blocks to use when combining these pieces.

Think of the Data Block as a list of instructions to do when you’re building batches and DataLoaders: it doesn’t explicitly need any items to be done; instead it is a blueprint of how to operate. In other words, writing a DataBlock is just like writing a blueprint.

Now, we just saw the word DataLoaders, but what is that? Well, PyTorch and fastai use two main classes to represent and access a training or validation set:

  • Dataset: A collection that returns a tuple of your independent and dependent variable for a single item.

  • **DataLoader*: An iterator that provides a stream of mini-batches, where each mini-batch is a couple of batches of independent variables and a batch of dependent variables

Interestingly enough, fastai provides two classes for you to bring your training and validation sets together:

Datasets: An object that contains a training Dataset and a validation Dataset.

DataLoaders: An object that contains a training DataLoader and a validation DataLoader.

The fastai library has an easy way of building DataLoaders so that it is simple enough for someone with minimal coding knowledge to understand, yet advanced enough to allow for exploration.

Steps

There are several steps that need to be followed in order to create data blocks.

The steps are defined by the data block API. They can be asked in the form of questions while looking at the data:

  1. What is the types of your inputs/targets? (Blocks)
  2. Where is your data? (get_items)
  3. Does something need to be applied to inputs? (get_x)
  4. Does something need to be applied to the target? (get_y)
  5. How to split the data? (splitter)
  6. Do we need to apply something on formed items? (item_tfms)
  7. Do we need to apply something on formed batches? (batch_tfms)

This is it!!

You can treat each question or step as a brick that builds the fastai data block:

  • Blocks
  • get_items
  • get_x/get_y
  • splitter
  • item_tfms
  • batch_tfms

Looking at the dataset is very important while building dataloaders. And using data block API is the strategy to solve problems. The first thing to look how data is stored, that is in which format or in which manner, and compare to the famous dataset, whether it is stored in that way and how to approach it.

Here, blocks are used to define a pre-defined problem domain. For example, if it’s an image problem, I can tell the library to use Pillow without explicitly saying it; or say it is a single label or multi-label classification. There are many like ImageBlock, CategoryBlock, MultiCategoryBlock, MaskBlock, PointBlock, BBoxBlock, BBoxLblBlock, TextBlock, and so on.

get_items: used to answer where is the data?

For example, in the image problem, we can use get_image_files function to go grab all the file locations of our images and can look at the data.

get_x is the answer to, “does something needs to be applied to inputs?”

get_y is how you extract labels.

splitter is you want to split your data. This is usually a random split between the training and validation dataset.

The remaining two bricks of data block API are item_tfms and batch_tfms:

item_tfms is item transform applied on an individual item basis. This is done on the CPU.

batch_tfms is batch transform applied on batches of data. This is done in GPU.

Using these bricks in the data block, we can approach and build data loaders that are ready for different types of problems like classification, object detection, segmentation, etc.

Data blocks API provides a good balance of conciseness and expressiveness. In the Data Science domain, the scikit-learn pipeline approach is widely used. This API provides a very high-level of expressiveness, but it is not opinionated enough to ensure that a user completes all of the steps necessary to get their data ready for modeling. However, all of this is done in the fastai data block API.

Now that we have seen what a data block API is, let’s wrap the shot up by building one.

It’s time! Let’s see the code (data block) for a single-label classification of the Oxford IIIT pets dataset:

pets = DataBlock(blocks=(ImageBlock, CategoryBlock), get_items=get_image_files, 
splitter=RandomSplitter(), 
get_y=Pipeline([attrgetter("name"), RegexLabeller(pat = r'^(.*)_\d+.jpg/span>)]), 
item_tfms=Resize(128), 
batch_tfms=aug_transforms())</span>

Curious to know what is in the code and how to write the code? Take a look at part 2.

RELATED TAGS

general

CONTRIBUTOR

Kiran U Kamath
Attributions:
  1. undefined by undefined
Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?