Machine learning 101 & data science: Tips from an industry expert

Data science and machine learning are deeply interconnected disciplines that employers expect practitioners to understand together. A working data scientist needs fluency in core Python libraries like NumPy, Pandas, and Scikit-learn, an understanding of ML algorithm types (supervised, unsupervised, semi-supervised, and reinforcement learning), and practical knowledge of modern tooling spanning MLOps, RAG pipelines, and responsible AI practices.

Key takeaways

Core Python libraries form the foundation: NumPy handles numerical computation, Pandas powers data wrangling, Scikit-learn provides ML models, and Matplotlib/Seaborn enable visualization.
ML models rely on three components: Every machine learning system is built from data, features (the variables the model examines), and the algorithm chosen to learn patterns from that data.
Classical ML still dominates structured data: Gradient boosting methods like XGBoost, LightGBM, and CatBoost consistently outperform deep neural networks on tabular business problems such as fraud detection and churn prediction.
MLOps turns experiments into production systems: A complete ML lifecycle includes experiment tracking, model versioning, deployment via APIs or batch jobs, monitoring for drift, and automated retraining.
RAG extends LLM capabilities with external context: Retrieval-augmented generation pipelines use chunking, vector embeddings, and semantic retrieval to ground large language model outputs in accurate, up-to-date information.

Modern data science has become inseparable from machine learning. While data science alone can gather insights from data, machine learning is the secret to creating accurate and actionable predictions.

As a result, employers expect data scientists to understand both. Understanding where they intersect is essential to land a data science position.

Today, we’ll explore the fundamentals of data science and ML to help you start leveraging both in your projects.

Master the skills that can get you a $100K+ salary
This course is your comprehensive guide to getting your start as a data scientist. Find easy to follow, hands-on explanations in one place.

Grokking Data Science

Essential data science Python libraries#

NumPy#

NumPy (Numerical Python) is a powerful, and extensively used library for storage and calculations. It’s designed for dealing with numerical data. It allows for data storage and calculations by providing data structures, algorithms, and other useful utilities.

For example, this library contains basic linear algebra functions, Fourier transforms, and advanced random number capabilities. It can also be used to load data to Python and export from it.

Here are some NumPy basics you should start to become familiar with:

NumPy basics

Creating NumPy arrays and array attributes
Array indexing and slicing
Reshaping and concatenation

NumPy arithmetic and statistics basics

Computations and aggregations
Comparison and boolean masks

Pandas#

Pandas is a library that you can’t avoid when working with Python on a data science project. It’s a powerful tool for data wrangling, a process required to prepare your data so that it can actually be consumed for analysis and model building.

Pandas contains a large variety of functions for data import, export, indexing, and data manipulation. It also provides handy data structures like DataFrames (series of columns and rows, and Series (1-dimensional arrays), and efficient methods for handling them.

For example, it allows you to reshape, merge, split, and aggregate data.

Here are some basic Pandas concepts you should start to become familiar with:

Pandas core components

The Series Object
The DataFrame Object

Pandas DataFrame Operations

Read, view, and extract information
Selection, slicing, and filtering
Grouping and sorting
Dealing with missing and duplicates
Pivot tables and functions

Scikit-learn#

Scikit-learn is an easy to use library for Machine Learning. It comes with a variety of efficient tools for machine learning and statistical modeling such as:

Classification models (e.g., Support Vector Machines, Random Forests, Decision Trees)
Regression Analysis (e.g., Linear Regression, Ridge Regression, Logistic Regression)
Clustering methods (e.g, k-means)
Data reduction methods (e.g., Principal Component Analysis, feature selection)
Model tuning and,
Selection with features like grid search, cross-validation. It also allows for pre-processing of data.

Jupyter Notebook can handle many other languages, like R, as well. Its intuitive workflows, ease of use, and zero-cost make it the tool at the heart of any data science project.

Main components of machine learning models#

There are three basic components you need to train your machine learning systems: data, features, and algorithms. As a data scientist, it’s important to understand how your choices for each of these components affects your final predictive model.

Data#

Data can be collected both manually and automatically. For example, users’ personal details like age and gender, all their clicks, and purchase history are valuable data for an online store.

Do you recall “ReCaptcha” which forces you to “Select all the street signs”? That’s an example of some free manual data!

Data is not always images; it could be tables of data with many variables (features), text, sensor recordings, sound samples etc., depending on the problem at hand.

Features#

Features are often also called variables or parameters. These are essentially the factors for a machine to look at — the properties of the “object” in question, e.g., users’ age, stock price, area of the rental properties, number of words in a sentence, petal length, size of the cells.

Choosing meaningful features is very important, but it takes practice. Sometimes it’s difficult to tell which features to use, especially when working with large datasets.

Algorithms#

Machine learning is based on general purpose algorithms.

For example, one kind of algorithm is classification. Classification allows you to put data into different groups.

The same classification algorithm used to recognize handwritten numbers could also be used to classify emails into spam and not-spam without changing a line of code! How is this possible?

Although the algorithm is the same, it’s fed different input data, so it comes up with different classification logic. Most of an ML system’s behavior comes from its training data rather than the starting algorithm.

However, this is not meant to imply that one algorithm can be used to solve all kinds of problems! The choice of the algorithm is important in determining the quality of the final machine learning model.

Below, we’ll see how different types of algorithms are better suited to some problems than others.

In other words, the algorithm has a supervisor or a teacher who provides it with all the answers first, like whether it’s a cat in the picture or not. The machine uses these examples to learn one by one.

Another typical task, of a different type, would be to predict a target numeric value like housing prices given information about the home.

To train the system, you need to provide many correct examples of known housing prices, including both their features (number of bedrooms, location, etc.) and their labels.

Categorizing emails or recognizing which pictures have dogs are both classification-type supervised learning algorithms. Predicting housing prices is a different type, known as regression.

In regression the output is a continuous value or a decimal number like housing prices. In classification, the output is a binary label like “spam or not-spam”.

Basically, the type of algorithm you choose (classification or regression) depends on the type of output you want.

Most used Supervised Learning Algorithms:

Linear Regression
Logistic Regression
Support Vector Machines
Decision Trees
Random Forests
K-Nearest Neighbors
Artificial Neural Networks

This system needs to learn without a teacher and finds relationships based on some hidden patterns in the data.

Segmentation like this is an example of what is known as clustering, classification with no predefined classes and based on some unknown features.

Most used Unsupervised Learning Algorithms:

Clustering: K-Means
Visualization and dimensionality reduction
Principal Component Analysis (PCA), t-distributed
Stochastic Neighbor Embedding (t-SNE)
Association rule learning: Apriori

Semi-supervised Learning#

Semi-supervised learning deals with partially labeled training data, usually a lot of unlabeled data with some labeled data.

Most semi-supervised learning algorithms are a combination of unsupervised and supervised algorithms.

Now you know some of fundamental ML concepts, but how do data scientists apply them?

Let’s walk through the steps of a data science project to understand how you’ll use ML in the workplace.

Frame the problem and look at the big picture:

The first step is to understand the problem. You have to figure out the right questions to ask, how to frame them and know the assumptions based on domain knowledge.
Get the data permissions:

Do NOT forget about data privacy and compliance here, they are of paramount importance! Ask questions and engage with stakeholders, if needed.
Summarize the data:

Find the type of variables or map out the underlying data structure. This involves finding correlations among variables, identifying the most important variables, checking for missing values and mistakes in the data etc.
Visualize the data

Create a visual of the data to see the bigger picture. You’ll likely find new trends, anomalies, and outliers. Use data summarization and data visualization techniques to understand the story the data is telling you.
Create a simplistic model:

Linear or logistic regression are good starting points. Start with only the most important features (directly observed and reported features). This will allow you to gain a good familiarity with the problem at hand and also set the right direction for the next steps.
Narrow your features:

Prepare the data to extract the more intricate data patterns. Combine and modify existing features to create new features.
Explore and select a model:

Here, you’ll explore many different machine learning models and short-list the best ones based on comparative evaluation, e.g., compare RMSE or ROC-AUC scores for different models.
Customize the model

Fine-tune the parameters of your chosen model to fit the problem. Consider combining multiple models them for the best results.
Present your solution:

Share your findings in a simple, engaging, and visually appealing manner. Remember to tailor your presentation and language based on the technical level of your target audience. Give extra focus to what your findings mean and why that matters in the interests of the company.

From here, employers may ask you to create a system to address your findings or may ask you to form a proposal of next steps.

These 9 steps reflect the standard day-to-day machine learning work you can expect in the industry.

From classical ML to modern AI: When to use what#

Machine learning isn’t just about linear regression or decision trees anymore. Today’s data scientists often choose between classical ML and large language models (LLMs) depending on the problem.

Classical ML is still ideal for structured data (like tabular business metrics, sensor readings, or credit scoring) and tasks where interpretability, training speed, and small datasets matter.
Deep learning shines with unstructured data — images, text, audio — and larger datasets.
LLMs and transformers enable entirely new applications: natural language interfaces, summarization, semantic search, and code generation.

The key is knowing when a simple model will outperform a complex one — and when the scale and nature of the data demand a transformer.

RAG in practice: Bringing context to large models#

Retrieval-augmented generation (RAG) is one of the most important shifts in machine learning since the rise of deep learning. It’s a technique that lets LLMs fetch relevant information from a knowledge base before generating an answer — resulting in more accurate, grounded, and up-to-date responses.

Key components of a RAG pipeline:

Chunking: Breaking large documents into manageable, meaningful pieces.
Vector embeddings: Converting text into numerical representations for semantic search.
Retrieval: Searching the vector database for the most relevant chunks.
Generation: Feeding that context into an LLM to craft a response.

For data scientists, understanding RAG means unlocking a new class of ML-powered applications — from chatbots over internal documentation to context-aware analytics tools.

MLOps essentials for data scientists#

Training a model is only part of the job. Deploying, monitoring, and maintaining it are equally crucial — and that’s where MLOps comes in.

Modern ML workflows follow a clear lifecycle:

Experiment tracking: Tools like MLflow or Weights & Biases help log hyperparameters, metrics, and results.
Model registry: Store and version models for reproducibility.
Deployment: Serve models via REST APIs, streaming endpoints, or batch jobs.
Monitoring: Track performance drift, data quality, and inference latency.
Retraining: Automate model refreshes when data shifts.

Mastering MLOps is what turns one-off notebooks into production-grade machine learning systems.

Modern tabular modeling: Why tree-based methods still win#

Deep learning gets the headlines, but for structured, tabular data, gradient boosting algorithms like XGBoost, LightGBM, and CatBoost still dominate.

They’re fast, require less data, and often outperform deep neural networks on classic business problems such as fraud detection, churn prediction, and recommendation ranking.

Best practice: always benchmark a tree-based model as a baseline — you might be surprised how hard it is to beat.

The modern data stack: Beyond pandas#

While pandas remains a core tool for data manipulation, new libraries make large-scale data wrangling faster and more efficient:

Polars: A lightning-fast DataFrame library with parallel execution and lazy evaluation.
DuckDB: An in-process analytical database that lets you run SQL queries directly on local files.
Arrow: A columnar memory format that powers high-performance data interchange between tools.

These tools integrate seamlessly with Python ML workflows and are becoming staples in data science pipelines.

Responsible AI and trustworthy ML#

As machine learning becomes embedded in more decisions, accountability matters. Responsible AI practices help ensure models are ethical, explainable, and compliant.

Start with these steps:

Model documentation: Use tools like Model Cards to communicate intended use cases, limitations, and performance.
Data transparency: Publish Datasheets for Datasets describing how your training data was collected and processed.
Risk assessment: Follow frameworks like the NIST AI RMF to evaluate bias, fairness, and potential harm.

Building trust isn’t just a legal requirement — it’s also good data science.

Deployment and evaluation in the LLM era#

Deploying modern ML systems — especially LLMs — involves new considerations:

Latency budgets: Optimize retrieval and generation for performance.
Prompt versioning: Track changes to prompt templates for reproducibility.
Evaluation: Measure hallucination rates, grounding accuracy, and user satisfaction.
Feedback loops: Capture user feedback to fine-tune models over time.

Treat deployment as an iterative process, not a final step — especially when working with generative models.

Where to go from here#

We’ve covered some of the basics of machine learning for data scientists, but there is still a lot more to learn and explore if you really want to get your career started, and you don’t have to go through it alone.

Industry expert and Microsoft Senior AI Engineer, Samia Khalid, has compiled her learnings into a comprehensive course, Grokking Data Science. This course lays out everything you’ll need in one place to get started and thrive in a data science career.

You’ll learn:

Python fundamentals for data science
The fundamentals of statistics
Machine learning 101
End-to-end machine learning project

By the end of the course, you’ll have extensive hands-on practice with Python data science tools and real-life experience with a data science project.

Happy learning!

Machine learning 101 & data science: Tips from an industry expert

Master the skills that can get you a $100K+ salary
This course is your comprehensive guide to getting your start as a data scientist. Find easy to follow, hands-on explanations in one place.

Grokking Data Science

Essential data science Python libraries#

NumPy#

Pandas#

Scikit-learn#

Matplotlib and Seaborn#

Bonus: Jupyter Notebook#

Main components of machine learning models#

Data#

Features#

Algorithms#

Keep learning about data science.#

Types of ML algorithms#

Supervised learning#

Unsupervised Learning#

Semi-supervised Learning#

Reinforcement Learning#

Steps of a data science project#

From classical ML to modern AI: When to use what#

RAG in practice: Bringing context to large models#

MLOps essentials for data scientists#

Modern tabular modeling: Why tree-based methods still win#

The modern data stack: Beyond pandas#

Responsible AI and trustworthy ML#

Deployment and evaluation in the LLM era#

Where to go from here#

Continue reading about data science and machine learning#

Machine learning 101 & data science: Tips from an industry expert

Master the skills that can get you a $100K+ salary This course is your comprehensive guide to getting your start as a data scientist. Find easy to follow, hands-on explanations in one place. Grokking Data Science

Essential data science Python libraries#

NumPy#

Pandas#

Scikit-learn#

Matplotlib and Seaborn#

Bonus: Jupyter Notebook#

Main components of machine learning models#

Data#

Features#

Algorithms#

Keep learning about data science.#

Types of ML algorithms#

Supervised learning#

Unsupervised Learning#

Semi-supervised Learning#

Reinforcement Learning#

Steps of a data science project#

From classical ML to modern AI: When to use what#

RAG in practice: Bringing context to large models#

MLOps essentials for data scientists#

Modern tabular modeling: Why tree-based methods still win#

The modern data stack: Beyond pandas#

Responsible AI and trustworthy ML#

Deployment and evaluation in the LLM era#

Where to go from here#

Continue reading about data science and machine learning#

Master the skills that can get you a $100K+ salary
This course is your comprehensive guide to getting your start as a data scientist. Find easy to follow, hands-on explanations in one place.

Grokking Data Science