Scikit-Learn is a machine learning library in Python. It hosts important machine learning algorithms used to solve clustering, classification, or regression problems.
To use Scikit-learn, we need to import the library abbreviated as sklearn
. This is shown below:
import sklearn
As with some libraries in Python, the Scikit-learn library comes with a set of built-in datasets. You will need to import the datasets library first in order to access the specific dataset of choice. The syntax used to import the datasets is:
from sklearn import datasets
If you already have an idea of the dataset you would like to use from the package, you can specify it. In the following example, we will import the diabetes
dataset. This dataset contains data from diabetic patients and contains certain features such as their bmi, age , blood pressure and glucose levels which are useful in predicting the diabetes disease progression in patients.
from sklearn.datasets import load_diabetes# to import the diabetes patients dataset
In order to import the diabetes
data as a numpy
array, set the return parameter to True
.
from sklearn import datasetsdiabetes_X,diabetes_y = datasets.load_diabetes(return_X_y = True)#loads the dataset as a numpy array
To import the testing data (x
) as a dataframe and the training data (y
) as a series, set the as_frame
parameter to True
.
from sklearn import datasetsdiabetes_X,diabetes_y = datasets.load_diabetes(return_X_y = True , as_frame = True)#the X,y data is converted to a dataframe and series respectively
This functionality was not available in
sklearn
version 0.22 and older, so in case you run into an error such as ‘unspecified keyword argument’as_frame
, upgrade yoursklearn
library using this code:!pip install scikit-learn == 0.23
on your jupyter notebook or pip3 install --upgrade scikit-learn on your python terminal
Have fun while exploring the diabetes dataset in scikit-learn library!