Search⌘ K
AI Features

Solution: Clean the Data

Explore how to clean data effectively by removing missing values and managing outliers using functions like dropna and quantile in Python. Learn to apply interquartile range methods to maintain data quality, enabling more accurate predictive analysis.

We'll cover the following...

Solution #

Python 3.5
def clean_data(df):
df = df.dropna() # dropping all rows with null values
# A list of all columns on which outliers need to be removed
out_list = ['median_house_value', 'median_income', 'housing_median_age']
quantiles_df = (df.quantile([0.25,0.75])) # computing 1st & 3rd quartiles
for out in out_list: # traversing through the list
Q1 = quantiles_df[out][0.25] # Retrieving value of 1st quartile
Q3 = quantiles_df[out][0.75] # Retrieving value of 3rd quartile
iqr = Q3 - Q1 # computing the interquartile range
lower_bound = (Q1 - (iqr * 1.5)) # computing lower bound
upper_bound = (Q3 + (iqr * 1.5)) # computing upper bound
col = df[out] # Storing reference of required column
col[(col < lower_bound)] = lower_bound # Assign outliers to lower bound
col[(col > upper_bound)] = upper_bound # Assign outliers to upper bound
return df
# Test Code
df = pd.read_csv('housing.csv')
df_res = clean_data(df.copy())
print(df_res)

Explanation

A function clean_data is declared with df passed to it as a parameter.

On line 3, the dropna() function of the DataFrame, which automatically finds and removes all NaN containing ...