Building a diabetes model pipeline
We will use a diabetes dataset and build an end-to-end machine learning pipeline for a regression problem.
The diabetes dataset is the subset obtained from the standard diabetes dataset available in Azure Open Datasets.
The dataset schema follows:
AGE - age in years (Integer)
SEX - sex (1/2)
BMI - body mass index (Float)
BP - average blood pressure (Float)
S1 - tc, total serum cholesterol (Integer)
S2 - ldl, low-density lipoproteins (Float)
S3 - hdl, high-density lipoproteins (Float)
S4 - tch, total cholesterol / HDL (Float)
S5 - ltg, possibly log of serum triglycerides level (Integer)
S6 - glu, blood sugar level
Preview of the dataset:
| 59 | 2 | 32.1 | 101 | 157 | 93.2 | 38 | 4 | 4.86 | 87 | 151 | 
|---|---|---|---|---|---|---|---|---|---|---|
| 48 | 1 | 21.6 | 87 | 183 | 103.2 | 70 | 3 | 3.89 | 69 | 75 | 
We have to predict the diabetic score of the patient. This is a classic regression problem.
The problem is broken down into the following tasks: