What are sampling techniques in data science?

Data scientists and researchers need to collect data for running tests, analyzing scenarios, and testing hypotheses. An ideal situation might be to obtain data from the entire population of the subject in question. However, this situation is not feasible. Lack of resources means data scientists must rely on data samples of the subject population.

Data samples are derived from the population that is being studied. The aim is to obtain samples that can represent the population so that the findings applicable to the sample can be generalized to the population.

The illustration below shows the difference between population and sample:

Sampling techniques

There are several ways data can be sampled from a target population. Sampling techniques can be divided into two broad categories:

Probability sampling: Every element of the population has an equal chance of getting selected and being a part of the sample space. Probability samples tend to be more representative of the population.

Non-probability sampling: Every element of the population does not have an equal chance of getting selected. This method of sampling might not always represent the population as a whole.

Probability sampling techniques

We will now discuss techniques that fall under the category of probability sampling:

What are sampling techniques in data science?

Sampling techniques

Probability sampling techniques

Simple random sampling

Stratified sampling

Cluster sampling

Non-Probability sampling techniques

Convenience sampling

Quota sampling