Trusted answers to developer questions

thedirtybits

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

**Exploratory Data Analysis (EDA)** is a way to investigate datasets and find preliminary information, insights, or uncover underlying patterns in the data. Instead of making assumptions, data can be processed in a systematic method to gain insights and make informed decisions.

Some advantages of Exploratory Data Analysis include:

`Improve understanding`

of variables by extracting averages, mean, minimum, and maximum values, etc.`Discover errors`

, outliers, and missing values in the data.`Identify patterns`

by visualizing data in graphs such as box plots, scatter plots, and histograms.

Hence, the main goal is to understand the data better and use tools effectively to gain valuable insights or draw conclusions.

The iris fisher dataset has been used to demonstrate EDA tasks as shown in the following code blocks.

The formed dataset contains a set of 150 records under five attributes - `sepal length (cm)`

, `sepal width (cm)`

, `petal length (cm)`

, `petal width (cm)`

, and `class`

(represents the flower species).

# Importing librariesimport pandas as pdimport matplotlibimport matplotlib.pyplot as pltfrom sklearn.datasets import load_iris# Loading data for analysisiris_data = load_iris()# Creating a dataframeiris_dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)iris_dataframe['class'] = iris_data.targetprint(iris_dataframe.head())

The first step in data analysis is to observe the statistical values of the data to decide if it needs to be preprocessed in order to make it more consistent

The ** describe() method** of a

`pandas`

data frame gives us important statistics of the data like `min`

, `max`

, `mean`

, `standard deviation`

, and `quartiles`

.For example, we want to verify the `minimum`

and `maximum`

values in our data. This can be done by invoking the `describe()`

method:

# Summary of numerical variablesprint(iris_dataframe.describe())

In order to identify the number of nulls within each column, we can invoke the `isnull()`

method on each column of the `pandas`

data frame.

If null values are found within a column, they can be replaced with the column mean using the `fillna()`

method:

# Retrieving number of nulls in each columnprint("Number of nulls in each column:")print(iris_dataframe.apply(lambda x: sum(x.isnull()),axis=0))# filling null values with mean for a columniris_dataframe['sepal length (cm)'].fillna(iris_dataframe['sepal length (cm)'].mean(), inplace=True)

As human beings, it is difficult to visualize statistical values. As an alternative, visualizations can be utilized in order to better understand the data and detect patterns.

Here, we can visualize our data using `histograms`

, `box-plot`

, and `scatter plot`

.

We will plot the frequency of `sepal width`

and `sepal length`

of the flowers within our dataset. This helps us to understand the underlying distribution:

# Histogram for sepal length and sepal widthfig = plt.figure(figsize= (10,5))ax1 = fig.add_subplot(121)ax1.set_xlabel('sepal length (cm')ax1.set_ylabel('Count')iris_dataframe['sepal length (cm)'].hist()ax2 = fig.add_subplot(122)ax2.set_xlabel('sepal width (cm)')ax2.set_ylabel('Count')iris_dataframe['sepal width (cm)'].hist(ax=ax2)plt.show()

Histograms for Sepal Length and Width (cm)

We can look for outliers in the `sepal width`

feature of our dataset; then, decide whether or not to remove these outliers from our dataset:

# Creating a box plotiris_dataframe.boxplot(column='sepal width (cm)', by = 'class');title_boxplot = 'sepal width (cm) by class'plt.title( title_boxplot )plt.suptitle('')plt.ylabel('sepal width(cm)')plt.show()

Box Plot for Sepal Width (cm)

For each class of flowers within our dataset, we can judge how `petal width`

and `petal length`

are related to each other:

# Scatter plot of petal length and petal width for different classescolor= ['red' if l == 0 else 'blue' if l==1 else'green' for l in iris_data.target]plt.scatter(iris_dataframe['petal length (cm)'], iris_dataframe['petal width (cm)'], color=color);plt.xlabel('petal length (cm)')plt.ylabel('petal width (cm)')plt.show()

Scatter Plot for Sepal Length vs. Width

RELATED TAGS

python

communitycreator

CONTRIBUTOR

thedirtybits

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring

Related Courses