Building a diabetes model pipeline

We will use a diabetes dataset and build an end-to-end machine learning pipeline for a regression problem.

The diabetes dataset is the subset obtained from the standard diabetes dataset available in Azure Open Datasets.

The dataset schema follows:

AGE - age in years (Integer)

SEX - sex (1/2)

BMI - body mass index (Float)

BP - average blood pressure (Float)

S1 - tc, total serum cholesterol (Integer)

S2 - ldl, low-density lipoproteins (Float)

S3 - hdl, high-density lipoproteins (Float)

S4 - tch, total cholesterol / HDL (Float)

S5 - ltg, possibly log of serum triglycerides level (Integer)

S6 - glu, blood sugar level

Preview of the dataset:

59 2 32.1 101 157 93.2 38 4 4.86 87 151
48 1 21.6 87 183 103.2 70 3 3.89 69 75

We have to predict the diabetic score of the patient. This is a classic regression problem.

The problem is broken down into the following tasks:

  1. Set the proper environment variables.
  2. Analyze the dataset.
  3. Clean the dataset.
  4. Train the dataset and build a model.
  5. Create an Azure pipeline by connecting Clean and Train modules.