Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

data sources
clean data
python

How to find and clean data sources

Ayaz Gillani

Data cleaning using Pandas

To work efficiently, we need error-free and non-corrupted data. To achieve data cleaning, we need the pandas library. To start using pandas, we first import it:

import pandas as pd

The next step is to import the .csv file:

data = pd.read_csv('./filename.csv')
#importing module
import pandas as pd
#importing the dataset by reading the csv file
data = pd.read_csv('./data.csv')
#displaying the first five rows of dataset 
data.head()
Reading and printing data file

We run the Jupyter Notebook below and verify the above code by running the helloworld.ipynb file:

import React from 'react';
require('./style.css');

import ReactDOM from 'react-dom';
import App from './app.js';

ReactDOM.render(
  <App />, 
  document.getElementById('root')
);

There are five functions that are helpful to locate and fill the missing data if present in the dataset:

data.isnull()
data.isna()
data.isna().any()
data.isna().sum()
data.isna().any().sum()
  • data.isnull() function: It gives the boolean value for the complete dataset to check if there is any null value is present or not.

  • data.isna() function: It is the same as the isnull() function.

  • data.isna().any() function: It also gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular form.

  • data.isna().sum()function: It gives sum of all the null values which are null column wise.

  • data.isna().any().sum() function: It gives output in a single value if any null is present or not.

fillna() function

After we locate the Null or NaN values in our dataset the next step is to fill those places with some other values. For this purpose, we can use fillna() function of DataFrame:

DataFrame_name.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

This function fills NA/NaN or 0 values in place of null spaces.

Parameters

Let’s discuss the arguments which are passed through the fillna() function:

  • value: Value to use to fill holes (places with null or NaN) (For example, 0). This value cannot be a list.

  • method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’} method to fill holes in reindexed series:

    • pad / ffill: Propagate last valid observation forward to next valid.
    • backfill / bfill: Use next valid observation to fill the gap.
  • axis: {0 or ‘index’, 1 or columns} axis along which to fill missing values.

  • Inplace: If true, this fills in our DataFrame in place, there is no copy, and our old DataFrame is overwritten.

  • limit: int, default None. This is the maximum number of consecutive NaN values to forward or backward fill if the method is specified.

  • downcast: We can set it to infer to get a dtype=int64.

RELATED TAGS

data sources
clean data
python

CONTRIBUTOR

Ayaz Gillani
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring