The Dataset and Exploratory Data Analysis

Learn how to do an exploratory data analysis with the breast cancer dataset.

We have learned two models for classification: logistic regression and KNN. According to the no free lunch theorem, we must find the best model for our data.

The breast cancer dataset

Most of the time, benign tumors are not dangerous since they can’t spread throughout the body (benign brain tumors, however, can be life-threatening). They can’t invade neighboring tissue and can be removed with a low risk of growing back. However, benign tumors can have other possible adverse health effects, and through the process of tumor progression, many of their types can turn malignant (cancerous).

Breast cancer is one of the most common cancers in women. The original breast cancer dataset has 569 observations and 30 features (all numeric). The target classes are M (malignant) and B (benign) types of breast cancer, and the class distribution is 212 Malignant (represented by 0) and 357 Benign (represented by 1).

In the dataset given below, there are 10 real-valued features that are computed for each cell nucleus:

  • Radius: Mean of distances from the center to points on the perimeter.

  • Texture: Standard deviation of grayscale values.

  • Perimeter: Total length of a shape’s boundary.

  • Area: Total space taken by a shape.

  • Smoothness: Local variation in radius lengths.

  • Compactness: perimeter2area1.0\frac{\text{perimeter}^2}{ area-1.0}.

  • Concavity: Severity of concave portions of the contour.

  • Concave points: Number of concave portions of the contour.

  • Symmetry: An object is said to have symmetry if it can be divided into two identical halves.

  • Fractal dimension: Coastline approximation1\text{Coastline approximation}-1.

The mean, the standard error, and the worst or largest values (mean of the three worst/largest values) of the above features were computed for each image, resulting in 30 features. For example, field 0 is the mean radius, field 10 is the radius standard error, and field 20 is the worst radius (please see the data columns and descriptions above). Let’s find our best model for the breast cancer data, be it KNN or logistic regression.

Basic imports

Let's start with importing the essential libraries.

Get hands-on with 1200+ tech skills courses.