Statistics

The statistics questions and answers in this lesson will help you understand the types of statistics questions you can expect in data science interviews.

What is the difference between overfitting and underfitting?

Most people think that model fitting is a machine-learning concept, but it is not true. Model fitting is an old statistical concept widely used in machine learning. To create models, we divide the data set into two parts: the train dataset and the test dataset. The test dataset is also called new data. So, don't get confused by the terms. We create a model on the train data set and test the model on the test data set.

The main difference between overfitting and underfitting is that overfitting has a high-accuracy model on the training data set. Still, it does not perform well on the test data set. Underfitting has a low-accuracy model, so it won't perform well on the test data set.

Which would you select, a linear regression model, R-Square, or adjusted R-square?

R-square and adjusted R-square are two model accuracy measures for linear regression. The R-square will always increase if we add more independent variables to the model. However, adjusted R-square will only increase if the newly added variable improves the model's accuracy. Therefore, we will choose the adjusted R-square to select a model.

What are type I and type II errors?

A type I error is the probability of rejecting the null hypothesis when it is true. For example, let's examine the chances of not trusting an honest person. Here, our null hypothesis is the person is honest, and the alternative hypothesis is the person is a liar, but we rejected the null hypothesis. A type II error is the probability of not rejecting the null hypothesis when it is false. For example, let's examine the chances of trusting a lying person. Here, our null hypothesis is the person is honest, and the alternative hypothesis is the person is a liar, but we do not reject the null hypothesis.

Explain why a continuity correction is needed when a continuous random variable approximates a discrete random variable to calculate a probability for the discrete random variable.

A continuity correction is needed when a continuous random variable approximates a discrete random variable to calculate a probability for the discrete random variable because a discrete random variable considers integer values. In contrast, a continuous variable considers real numbers. A continuity correction makes it easy to compute the probabilities of each discrete random variable when the sample values are large. For example, the normal distribution is a continuous distribution that can take any values within an interval and is used to approximate the binomial distribution.

What is a p-value?

The definition of p-value is one of the most common questions in data science interviews. And many data science interviewers complain that people don't know what a p-value is. So many people know how to use the p-value but don't understand its meaning. The p-value is the probability of getting a sample statistic at least as extreme as we got using our sample data if we take more samples of the same size when the null hypothesis is true.

Suppose this probability is greater than the significance level (5%). In that case, we do not reject the null hypothesis because that means our sample statistic is more likely to happen under the null hypothesis. Hence, it is compatible with the null hypothesis. Suppose this probability is less than the level of significance. In that case, we reject the null hypothesis because that means our sample statistic is unlikely to happen under the null hypothesis. Hence, it is not compatible with the null hypothesis.

Assume you have a dataset, and some variables are highly correlated to each other. You need to run a principal component analysis on this data. Would you remove correlated variables from the data? If yes, then why, and if no, then why not?

We should remove correlated variables first because if we include correlated variables, then the variance of components that have correlated variables will be high. Also, the principal component analysis will give more importance to correlated variables. Therefore, the components will be misleading.

What do interpolation and extrapolation mean in regression modeling?

Interpolation is estimating the new value of the dependent variable using the values of independent variables in the dataset. Extrapolation is estimating the new value of the dependent variable using the values of independent variables that do not exist in the dataset.

What are confounding variables?

The confounding variables are the variables that are correlated with the dependent variable and independent variables but are not included in the model. For example, if we want to check the effect of weight gain on blood pressure, fat intake can be a confounding variable that is linearly related to weight gain and blood pressure.

What is AIC?

AIC (Akaike information criteria) is the measure of goodness-of-fit for the logistic model. It penalizes the model for more variables. Therefore, a model with a minimum value of AIC is preferred to get the best model with fewer variables. It is also called an analogous measure of adjusted R² in logistic regression.

What is the use of orthogonal rotation in principal component analysis?

Orthogonal rotation maximizes the difference between variance captured by the component so that the components can explain the overall variance of the data subjective of PCA to get a smaller number of components than independent variables, which helps reduce the dimension of the data set and makes analysis easier. If the components are not rotated, then the effect of PCA will be reduced, and we will require more components to explain the variance in the data set, which will not fulfill our objective from PCA. Therefore, we should rotate the components.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.