Trusted answers to developer questions

Hassaan Waqar

The data collection process can be expensive in data science. Often at times, we can only collect a limited amount of data. We need to estimate quantities related to a population known as a **population statistic**. These can include the mean, median, or standard deviation of the population data. If large amounts of data are not available and further data collection is not possible, we can rely on a technique known as **bootstrap sampling.**

Bootstrap sampling involves estimating summary statistics by averaging estimates from small amounts of randomly sampled data from the original data we have. This process is done with replacement. This means a value sampled from the original data will form a part of the smaller sample and be replaced with the original data. Thus, a single value can be a part of the smaller sample more than once.

A single smaller sample can be made by following the steps below:

- Choose a sample size (smaller/equal to the size of original data).
- Pick a value from the original data.
- Add the value to the smaller sample.
- Replace the value into the original data.
- Repeat until the sample size is complete.

Sample size can be as big as the size of original data. However, it is usually not computationally feasible. Hence, a size of 50% to 80% of the original data is mostly used.

The illustration below shows how we can bootstrap a single sample:

Since we form smaller samples to estimate a statistic such as mean or median, we need to calculate the required statistic for each smaller sample that we form. We can choose several bootstrap samples that we will form and calculate the statistic for each sample. The estimated statistic will be an average of all the statistics obtained from each smaller sample.

We can summarize the process as follows:

- Choose the number of bootstrap samples to form.
- Choose the size of each sample. Randomly choose values from the original data and add to the smaller samples with replacement.
- Once the smaller sample is formed, estimate the required statistic.
- Repeat the process for each bootstrap sample.
- Take the average of all calculated statistics from each bootstrap sample.

Bootstrap is a simple technique to obtain an estimate of the population statistics. It has the following advantages:

- Randomly sampled data tends to achieve results closer to the actual population statistic.
- Cost for further data collection is avoided.
- Simple and straightforward technique for obtaining estimates or population statistics.

The code snippet below shows a simple example of bootstrap sampling in Python:

import numpy as np import random x = np.random.normal(loc= 300.0, size=1000) # Creating a normal random sample of size 1000 centered around 300 print("Actual Mean:", np.mean(x)) # Mean of original sample for comparison later sample_mean = [] # To store means of each smaller sample for i in range(50): # Create 50 bootstrap samples y = random.sample(x.tolist(), 30) # Randomly take 30 values with replacement avg = np.mean(y) # Find mean of the smaller sample sample_mean.append(avg) # Add mean to the list print("Bootstrapped Mean:", np.mean(sample_mean)) # Take average of all statistics collected from smaller samples

The code above shows how bootstrap sampling produces similar results compared to the original data.

- We have created an original array of 1000 values normally distributed and centered around 300.
- The number of bootstrap samples created is 50, with each sample having 30 values.
- Mean of each sample is calculated and added to the list.
- Finally, the average is taken of all the means.

RELATED TAGS

data science

CONTRIBUTOR

Hassaan Waqar

Copyright ©2022 Educative, Inc. All rights reserved

RELATED COURSES

View all Courses

Keep Exploring

Related Courses