Solution: Clean the Data

Explore how to clean data effectively by removing missing values and managing outliers using functions like dropna and quantile in Python. Learn to apply interquartile range methods to maintain data quality, enabling more accurate predictive analysis.

We'll cover the following...

- Solution
- Explanation

Python 3.5

def clean_data(df):
    df = df.dropna() # dropping all rows with null values
    # A list of all columns on which outliers need to be removed
    out_list = ['median_house_value', 'median_income', 'housing_median_age']
    quantiles_df = (df.quantile([0.25,0.75])) # computing 1st & 3rd quartiles
    for out in out_list: # traversing through the list
        Q1 = quantiles_df[out][0.25] # Retrieving value of 1st quartile
        Q3 = quantiles_df[out][0.75] # Retrieving value of 3rd quartile
        iqr = Q3 - Q1 # computing the interquartile range
        lower_bound = (Q1 - (iqr * 1.5)) # computing lower bound 
        upper_bound = (Q3 + (iqr * 1.5)) # computing upper bound
        col = df[out] # Storing reference of required column
        col[(col < lower_bound)] = lower_bound # Assign outliers to lower bound
        col[(col > upper_bound)] = upper_bound # Assign outliers to upper bound
    return df
# Test Code
df = pd.read_csv('housing.csv')
df_res = clean_data(df.copy())
print(df_res)

1.Getting Started

2.Numpy for Python

3.Pandas for Python

4.Statistics for Data Analysis

5.Data Wrangling

6.Visualizing the Data

7.Data Scraping

8.Project #1

9.Project #2

Project

10.Conclusion

Assessment

Solution: Clean the Data

Solution