Machine learning (ML) is the future of our world. In years to come, nearly every product will include ML components. ML is projected to grow from $7.3B in 2020 to $30.6B in 2024. This demand for ML skills is pervasive across the industry.
The machine learning interview is a rigorous process where candidates are assessed both for their knowledge of basic concepts and for understanding of ML systems, real-world applications, and product-specific demands.
If you are looking for a career in machine learning, it is crucial to understand what is expected in the interview. So, to help you prepare, I have collected the top 40 machine learning interview questions. We will begin with some of the basics and then move to advanced questions.
Today we will go over:
Machine learning interview questions are an integral part of becoming a data scientist, machine learning engineer, or data engineer. Depending on the company, the job description title for a Machine Learning engineer may differ. You can expect to see titles like Machine Learning Engineer, Data Scientist, AI Engineer, and more.
Companies hiring for machine learning roles conduct interviews to assess individual abilities in various areas. ML interview questions tend to fall into one of these four categories.
ML interview questions now focus heavily on system design. In the ML system design interview portion, candidates are given open-ended ML problems and are expected to build an end-to-end machine learning system. Common examples are recommendation systems, visual understanding systems, and search-ranking systems.
The Google ML interview, commonly called the Machine Learning Engineer interview, emphasizes skills in Algorithms, Machine Learning, and Python.
Some common questions include gradient descent, regularization/normalization methods, and embeddings.
The interview process will be generic rather than focused on one particular team or project. Once you pass the interview, they will assign you to a team that fits your skill set.
The Amazon ML interview, called the Machine Learning Engineer Interview, focuses heavily on e-commerce ML tools, cloud computing, and AI recommendation systems.
Amazon ML engineers are expected to build ML systems and use Deep Learning models. Data scientists bridge data-driven gaps between the technical and business sides. Research scientists have higher levels of education and work to improve ASR, NLU, and TTS features.
The technical portion of the ML interview focuses on ML models, bias-variance tradeoff, and overfitting.
The Facebook ML Interview consists of generic algorithm questions, ML design, and system design. You’ll be expected to work with newsfeed ranking algorithms and local search rankings. Facebook looks for engineers who understand components of an end-to-end ML system, including deployment.
Some common interview titles you may encounter are Research Scientist, Data Science Interview, or Machine Learning Engineer. Like Amazon, they differ slightly in their focus and demand for generalist knowledge.
The data scientist roles at Twitter includes both data and research scientists roles that are each tailored to different teams.
The technical portion of interviews tests your application and intuition for ML theory (including SQL and Python). Twitter looks for knowledge of statistics, experimental models, product intuition, and system design.
Now let’s dive into the top 40 questions for an ML interview. These questions are broken into beginner, intermediate, advanced, and product specific questions.
Bias (how well a model fits data) refers to errors due to inaccurate or simplistic assumptions in your ML algorithm, which leads to overfitting.
Variance (how much a model changes based on inputs) refers to errors due to complexity in your ML algorithm, which generates sensitivity to high levels of variation in training data and overfitting.
In other words, simple models are stable (low variance) but highly biased. Complex models are prone to overfitting but express the truth of the model (low bias). The optimal reduction of error requires a tradeoff of bias and variance to avoid both high variance and high bias.
Supervised learning requires training labeled data. In other words, supervised learning uses a ground truth, meaning we have existing knowledge of our outputs and samples. The goal here is to learn a function that approximates a relationship between inputs and outputs.
Unsupervised learning, on the other hand, does not use labeled outputs. The goal here is to infer the natural structure in a dataset.
Supervised learning algorithms:
Examples of unsupervised algorithms:
The main difference is that KNN requires labeled points (classification algorithm, supervised learning), but k-means does not (clustering algorithm, unsupervised learning).
To use K-Nearest Neighbors, you use labeled data that you want to classify into an unlabeled point. K-means clustering takes unlabeled points and learns how to group them using the mean of the distance between points.
Bayes’ Theorem is how we find a probability when we know other probabilities. In other words, it provides the posterior probability of a prior knowledge event. This theorem is a principled way of calculating conditional probabilities.
In ML, Bayes’ theorem is used in a probability framework that fits a model to a training dataset and for building classification predictive modeling problems (i.e. Naive Bayes, Bayes Optimal Classifier).
Naive Bayes classifiers are a collection of classification algorithms. These classifiers are a family of algorithms that share a common principle. Naive Bayes classifiers assume that the occurrence or absence of a feature does not influence the presence or absence of another feature.
In other words, we call this “naive”, as it assumes that all dataset features are equally important and independent.
Naive Bayes classifiers are used for classification. When the assumption of independence holds, they are easy to implement and yield better results than other sophisticated predictors. They are used in spam filtering, text analysis, and recommendation systems.
A Type I error is a false positive (claiming something has happened when it hasn’t), and a Type II error is a false negative (claiming nothing has happened when it actually has).
A discriminative model learns distinctions between different categories of data. A generative model learns categories of data. Discriminative models generally perform better on classification tasks.
Parametric models have a finite number of parameters. You only need to know the parameters of the model to make a data prediction. Common examples are as follows: linear SVMs, linear regression, and logistic regression.
Non-parametric models have an unbounded number of parameters to offer flexibility. For data predictions, you need the parameters of the model and the state of the observed data. Common examples are as follows: k-nearest neighbors, decision trees, and topic models.
An array is an ordered collection of objects. It assumes that every element has the same size, since the entire array is stored in a contiguous block of memory. The size of an array is specified at the time of declaration and cannot be changed afterward.
Search options for an array are Linear search and Binary search (if it’s sorted).
A linked list is a series of objects with pointers. Different elements are stored at different memory locations, and data items can be added or removed when desired.
The only search option for a linked list is Linear.
Additional beginner questions may include:
- Which is more important: model performance or accuracy? Why?
- What’s the F1 score? How is it used?
- What is the Curse of Dimensionality?
- When should we use classification rather than regression?
- Explain Deep Learning. How does it differ from other techniques?
- Explain the difference between likelihood and probability.
These intermediate questions take the basic theories of ML from above and apply them in a more rigorous way.
A time series is not randomly distributed but has a chronological ordering. You want to use something like forward chaining so you can model based on past data before looking at future data. For example:
For a small training set, a model with high bias and low variance models is better, as it is less likely overfit. An example is Naive Bayes.
For a large training set, a model with low bias and high variance models is better, as it expresses more complex relationships. An example is Logistic Regression.
The ROC curve is a graphical representation of the performance of a classification model at all thresholds. It has two thresholds: true positive rate and false positive rate.
AUC (Area Under the ROC Curve) is, simply, the area under the ROC curve. AUC measures the two-dimensional area underneath the ROC curve from (0,0) to (1,1). It used as a performance metric for evaluating binary classification models.
Latent Dirichlet Allocation (LDA) is a common method for topic modeling. It is a generative model for representing documents as a combination of topics, each with their own probability distribution.
LDA aims to project the features of higher dimensional space onto a lower-dimensional space. This helps to avoid the curse of dimensionality.
There are three methods we can use to prevent overfitting:
SQL is one of the most popular data formats used in ML, so you need to demonstrate your ability to manipulate SQL databases.
Foreign keys allow you to match and join tables on the primary key of the corresponding table.
If you encounter this question, answer the basic concept, and the explain how you would set up SQL tables and query them.
First, you would split the dataset into training and test sets. You could also use a cross-validation technique to segment the dataset. Then, you would select and implement performance metrics. For example, you could use the confusion matrix, the F1 score and accuracy.
You’ll want to explain the nuances of how a model is measured based on different parameters. Interviewees that stand out take questions like these one step further.
You need to identify the find data and drop the rows/columns, or replace them with other values.
Pandas provides useful methods for doing this: isnull()
and dropna()
. These allow you to idenitfy and drop corrupted data. The fillna()
method can be used to fill invalid values with placeholders.
Data pipelines enable us to take a data science model and automate or scale it. A common data pipeline tool is Apache Airflow, and Google Cloud, Azure, and AWS are used to host them.
For a question like this, you want to explain the required steps and discuss real experience you have building data pipelines.
The basic steps are as follows for a Google Cloud host:
If the model has low variance and high bias, we use a bagging algorithm, which divides a data set into subsets using randomized sampling. We use those samples to generate a set of models with a single learning algorithm.
Additionally, we can use the regularization technique, in which higher model coefficients are penalized to lower the complexity overall.
A model parameter is a variable that is internal to the model. The value of a parameter is estimated from training data.
A hyperparameter is a variable that is external to the model. The value cannot be estimated from data, and they are commonly used to estimate model parameters.
Choosing an ML algorithm depends of the type of data in question. Business requirements are necessary for choosing an algorithm and building a is to build a model as well, so when answering this question, explain that you need more information.
For example, if you data organizes in a linear fashion, linear regression would be a good algorithm to use. Or, if the data is made up of non-linear interactions, a bagging or boosting algorithm is best. Or, if you’re working with images, a neural network would be best.
Advantages:
Disadvantages:
The default method is the Gini Index, which is the measure of impurity of a particular node. Essentially, it calculates the probability of a specific feature that is classified incorrectly. When the elements are linked by a single class, we call this “pure”.
You could also use Random Forest, but the Gini Index is preferred because it isn’t computationally intensive and doesn’t involve logarithm functions.
Additional intermediate questions may include:
- What is a Box-Cox transformation?
- Water Tapping problem
- Explain the advantages and disadvantages of decision trees.
- What is the exploding gradient problem when using back propagation technique?
- What is a confusion matrix? Why do you need it?
The data is spread across median, so we can assume we’re working with normal distribution. This means that approximately 68% of the data lies at 1 standard deviation from the mean. So, around 32% of the data unaffected.
You should create a correlation matrix to identify and remove variables with a correlation above 75%. Keep in mind that our threshold here is subjective.
You could also calculate VIF (variance inflation factor) to check for the presence of multicollinearity. A VIF value greater than or equal to 4 suggests that there is no multicollinearity. A value less than or equal to 10 tells us there are serious multicollinearity issues.
You can’t just remove variables, so you should use a penalized regression model or add random noise in the correlated variables, but this approach is less ideal.
XGBoos is an ensemble method that uses many trees. This means it improves as it repeats itself.
SVM is a linear separator. So, if our data is not linearly separable, SVM requires a Kernel to get the data to a state where it can be separated. This can limit us, as there is not a perfect Kernel for every given dataset.
Your model is likely overfitted. A training error of 0.00 means that the classifier has mimicked training data patterns. This means that they aren’t available for our unseen data, returning a higher error.
When using random forest, this will occur if we use a large amount of trees.
This will largely depend on the model at hand, so you could ask clarifying questions. But generally, the process is as follows:
Explanation:
Recall = TP / (TP+FN) = 10/50 = 0.2 = 20%
Specificity = TN / (TN+FP) = 15/50 = 0.3 = 30%
Precision = TP/ (TP + FP) = 10 / 45 = 0.2 = 22%
We use the encoder-decoder model to generate an output sequence based on an input sequence.
What makes an encoder-decoder model so powerful is that the decoder uses the final state of the encoder as its initial state. This gives the decoder access to the information that the encoder extracted from the input sequence.
EstimatorSpec
?The loss metric is required. In model execution with TensorFlow, we use the EstimatorSpec
object to organize training, evaluation, and prediction.
The EstimatorSpec
object is initialized with a single required argument, called mode. The mode can take one of three values:
tf.estimator.ModeKeys.TRAIN
tf.estimator.ModeKeys.EVAL
tf.estimator.ModeKeys.PREDICT
The keyword arguments required to initialize the EstimatorSpec
will differ depending on the mode.
Yes. Most of the machine learning algorithms use Euclidean distance as the metrics to measure the distance between two data points. If the range of values is different greatly, the result of the same change in the different features will be very different.
There are three general approaches you could take:
Additional advanced questions may include:
- You must evaluate a regression model based on R², adjusted R² and tolerance. What are your criteria?
- For k-means or kNN, why do we use Euclidean distance over Manhattan distance?
- Linear regression models are usually evaluated using Adjusted R² or an F value. How would you evaluate a logistic regression model?
- Explain the difference between the normal soft margin SVM and SVM with a linear kernel.
Companies want to see that you can apply ML concepts to their real-world products and teams. You can expect questions about a company’s ML-based products and even be required to design them on your own.
Many ML interview questions like this involve implementing models to an organization’s specific problems. To answer this question well, you need to research the company in advance. Read about revenue drivers and user base.
Important: Use questions like these to demonstrate your system design skills! You need to sketch out a solution with requirements, metrics, training data generation, and ranking.
Grokking the Machine Learning Interview goes over this question in detail using Netflix’s recommendation system.
The general steps for setting up a recommendation system are as follows:
This tests your knowledge of the business/industry. It also tests for how you correlated data to business outcomes and applies it to a particular company’s needs. You need to research an organization’s business model. Be sure to ask questions to clarify the question further before jumping in.
Some general answers could be:
The main goal of an ads selection component is to narrow down the set of ads that are relevant for a given query. In a search-based system, the ads selection component is responsible for retrieving the top relevant ads from the ads database according to the user and query context.
In a feed-based system, the ads selection component will select the top k relevant ads based more on user interests than search terms.
Here is a general solution to this question. Say we use a funnel-based approach for modeling. It would make sense to structure the ad selection process in these three phases:
Again, this question largely depends on the organization in question. You’ll first want to ask clarifying questions about the system to make sure you meet all its needs. You can speak in hypotheticals to leave room for inaccuracy.
I will explain it using Twitter’s feed system to give you a sense of how to approach a problem like this. It will include:
This question gauges your investment in the industry and you vision for how to apply new technologies. GPT-3 is a new language generation model that can generate human-like text.
There are many perspectives on GPT-3, so do some reading on how it’s being used to demonstrate next-generation critical thinking.
Some general answers could be:
Additional questions could include:
- Design an ad prediction system for our company.
- What are the metrics for search ranking?
- What do you think of our current data process?
- Describe your research experience in machine learning.
- Write a query in SQL to measure the number of ads were viewed in moments versus news feed.
- How do you think quantum computing will affect ML at this organization?
- Which of our current products could benefit from ML components?
Congrats! You’ve now learned the top 40 questions you will encounter in a machine learning interview. There is still a lot to learn to solidify your knowledge and get hands-on with system design, Python, and all the ML tools.
Be sure to review the additional questions I provided at the end of each section.
To move right into more practice, check out Educative’s course Grokking the Machine Learning Interview. You’ll learn how to design systems from scratch and develop a high-level ability to think about ML systems. This is the ideal place to take your ML skills to the next level and stand out from the competition.
Happy learning!
Join a community of more than 1 million readers. A free, bi-monthly email with a roundup of Educative's top articles and coding tips.