Exploratory Data Analysis
Explore how to conduct exploratory data analysis on numeric and categorical variables using histograms, bar charts, scatter plots, and hue mapping. Understand data distribution and relationships essential for preparing regression models with PyCaret.
We'll cover the following...
We’ll now perform EDA on our data. As mentioned earlier, EDA is a method that helps us understand the dataset properties by using descriptive statistics and visualization. It is an important part of every machine learning or data science project because it’s essential that we understand the data set before we utilize it.
Histogram of numeric variables
The distribution of numeric variables can be visualized with a histogram that can be easily created with the hist() function.
As we can see in the output, some of the variables have right-skewed distributions that may cause problems with regression models, so we’ll have to deal with that later.
Bar charts of categorical variables
Using bar charts is the standard way of plotting categorical variables. We can accomplish that easily by using the value_counts() and plot() functions.
As we can see in the output, the smoker variable has uneven distribution, with only % of people being smokers. On the other hand, the sex and region variables are equally distributed.
Numeric and categorical features
The histplot() Seaborn function lets us visualize the relationship between numeric and categorical variables using hue mapping.
In this case, we plot the target variable histogram, colored differently for every category of the smoker, sex, and region variables. Smokers get significantly higher charges compared to non-smokers. This is expected because the health risks associated with smoking are numerous and well-documented.
Scatter plots
Scatter plots are a type of visualization that helps us understand the relationship between numeric variables. The pairplot() Seaborn function creates a grid of scatter plots for all pairs of numeric variables in a given dataset.
The diagonal contains distribution plots of the variables, such as histograms or kernel density estimation (