Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

data science

What are sampling techniques in data science?

Hassaan Waqar

Data scientists and researchers need to collect data for running tests, analyzing scenarios, and testing hypotheses. An ideal situation might be to obtain data from the entire population of the subject in question. However, this situation is not feasible. Lack of resources means data scientists must rely on data samples of the subject population.

Data samples are derived from the population that is being studied. The aim is to obtain samples that can represent the population so that the findings applicable to the sample can be generalized to the population.

The illustration below shows the difference between population and sample:

Population and sample

Sampling techniques

There are several ways data can be sampled from a target population. Sampling techniques can be divided into two broad categories:

Probability sampling: Every element of the population has an equal chance of getting selected and being a part of the sample space. Probability samples tend to be more representative of the population.

Non-probability sampling: Every element of the population does not have an equal chance of getting selected. This method of sampling might not always represent the population as a whole.

Probability sampling techniques

We will now discuss techniques that fall under the category of probability sampling:

Simple random sampling

Simple Random Sampling or SRS is of the simplest methods of sampling that selects a subject randomly based on probability. Each element has an equal chance of getting selected. Sampling is usually done by assigning numbers to each sample and carrying out a lucky draw.

In the illustration on the right, each individual has a chance of 115\frac{1}{15} of getting selected.

Simple Random Sampe

Stratified sampling

In stratified sampling, elements are first sub-grouped based on common characteristics such as gender, age, income level, profession, etc. These subgroups are known as stratas. Elements are then sampled from each strata. This method ensures that sampled data has representation from all subgroups.

The illustration on the right creates stratas based on profession and then samples them.

Statified Sampling

Elements are homogeneous within stratas.

It is not necessary that there is an equal number of elements within each strata.

Each element within a strata has an equal probability of being selected.

Cluster sampling

In cluster sampling, we divide our target population into subgroups known as clusters and then choose a cluster at random. Each cluster has an equal chance of getting selected.

The illustration on the right shows each cluster having a chance of 14\frac{1}{4} of being selected.

Cluster Sampling

Elements within clusters are heterogeneous.

Non-Probability sampling techniques

We will now discuss techniques that fall under the category of non-probability sampling:

Convenience sampling

In convenience sampling, samples are selected based on availability and convenience. This might include on the basis of first-come-first-serve or willingness to take part in a survey.

The illustration on the right chooses first three individuals from each line.

Convenience samples are not representative of the population since they are subject to biases such as gender, race, age, religion, etc.

Convenience Sampling

Quota sampling

Quota sampling involves selecting elements based on some pre-determined rule. This can include selecting multiples of a number, taking every fifth person to sign up, etc.

The illustration on the right shows balls that are multiples of two being selected.

Quota samples are not representative of the population as well.

Quota Sampling

RELATED TAGS

data science

CONTRIBUTOR

Hassaan Waqar
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring