Building a diabetes model pipeline

We will use a diabetes dataset and build an end-to-end machine learning pipeline for a regression problem.

The diabetes dataset is the subset obtained from the standard diabetes dataset available in Azure Open Datasets.

The dataset schema follows:

AGE - age in years (Integer)

SEX - sex (1/2)

BMI - body mass index (Float)

BP - average blood pressure (Float)

S1 - tc, total serum cholesterol (Integer)

S2 - ldl, low-density lipoproteins (Float)

S3 - hdl, high-density lipoproteins (Float)

S4 - tch, total cholesterol / HDL (Float)

S5 - ltg, possibly log of serum triglycerides level (Integer)

S6 - glu, blood sugar level

Preview of the dataset:

59	2	32.1	101	157	93.2	38	4	4.86	87	151
48	1	21.6	87	183	103.2	70	3	3.89	69	75

We have to predict the diabetic score of the patient. This is a classic regression problem.

The problem is broken down into the following tasks: