Data Science Cheatsheet PDF
Explore core data science concepts in this cheatsheet PDF that covers data preprocessing, exploratory analysis, feature engineering, and machine learning fundamentals. Understand practical techniques to prepare data, analyze patterns, build models, and avoid common mistakes for effective real-world application.
Data science involves a wide range of concepts, from statistics and data preprocessing to feature engineering and machine learning. It’s easy to forget formulas, workflows, or key principles under time pressure. This lesson will help us understand and connect these core ideas, enabling us to apply them confidently in projects, interviews, or analysis.
Data understanding and preparation
Before analyzing data, it’s essential to clean, preprocess, and structure it. High-quality, well-prepared data forms the backbone of accurate insights and effective models. Understanding the nature of the data allows us to detect patterns, prevent errors, and make informed decisions.
The following are the core techniques:
Handling missing values: Strategies include deletion (removing rows or columns), mean/median/mode imputation, or predictive imputation using ML models.
Outlier detection and treatment: Identify extreme values and decide whether to remove, cap, or transform them.
Data transformation: Scale numerical features, encode categorical variables, bin continuous data, or apply log/power transformations to prepare data.
Practical tip: Apply preprocessing and transformations only on the training data, then reuse the same parameters on validation or test data to prevent
Exploratory data analysis (EDA)
EDA helps us summarize, visualize, and understand data before modeling. It uncovers trends, anomalies, and relationships that inform subsequent analysis and feature engineering.
The following are the key techniques:
Descriptive statistics: Summarize central tendency and spread using measures like mean, median, mode, variance, standard deviation, and IQR. For example, calculating the mean and standard deviation of monthly sales to understand average performance and variability.
Distribution analysis: Assess skewness and kurtosis to understand the shape of the data and detect outliers. For example, using a histogram to check if income data is right-skewed and applying a log transform if necessary.
Correlation analysis: See how variables relate to each other using Pearson for linear relationships or
for monotonic relationships. For example, checking the correlation between study hours and exam scores to identify predictive relationships.Spearman Measures the strength and direction of a monotonic relationship between two variables using their ranks (e.g., ranking students by study hours and exam scores to see if higher study hours generally correspond to higher scores). Visualization tools: Use plots to reveal patterns, trends, and interactions in the data. For example, boxplots to detect outliers in salary data, scatter plots to visualize relationships between age and spending, and heatmaps to show correlations between multiple variables.
Practical tip: Always question whether observed relationships have a logical mechanism or could be influenced by confounding factors.
Probability and statistics
Probability and statistics provide the framework to quantify uncertainty, test assumptions, and validate conclusions, forming the foundation of data-driven decision-making.
The following are the key concepts:
Probability rules: NOT, AND, OR, and conditional probability guide the calculation of the likelihood of events.
Bayes’ theorem: Updates beliefs as new evidence is observed, allowing us to revise probabilities in light of new information.
Distributions: A probability distribution shows all the possible outcomes of an event and how likely each one is to happen.
Discrete distributions: Used for countable data where outcomes are distinct and separate points (e.g., the number of heads in a coin flip or the number of customers in a shop). Common discrete distributions include Binomial, Poisson, and Power-law.
Continuous distributions: Used for measurable data that can take any value within a range (e.g., height, weight, or time). Because there are infinitely many possible values (like 5.11 inches, 5.112 inches, etc.), probabilities are measured over intervals rather than at exact points. Common continuous distributions include Normal, Exponential, and Uniform.
Statistical analysis: Techniques such as sampling distributions, confidence intervals, hypothesis testing, and p-value interpretation ensure that results are evidence-based and conclusions are reliable.
Practical tip: Apply these concepts to distinguish real patterns from noise, avoid overgeneralization, and validate your insights rigorously.
Feature engineering
Feature engineering transforms raw data into meaningful variables that improve model performance. Well-engineered features often matter more than complex algorithms.
The following are the core methods:
Continuous features: Techniques such as scaling, normalization, binning, and log/power transformations prepare numerical data for modeling. For example, standardizing income values or applying a log transform to reduce skew in sales data.
Categorical features: Encode categorical data into numeric formats using methods like one-hot encoding, label encoding, hashing, or embeddings. For example, one-hot encoding color with values red, blue, and green into separate binary columns.
Interactions and combinations: Create new features by combining existing ones, such as polynomial features or cross-features, to capture complex patterns. For example, multiplying age and years of experience to generate an interaction term for employee productivity prediction.
Dimensionality reduction and selection: Reduce feature space or select the most informative features using techniques such as PCA (Principal Component Analysis), recursive feature elimination, or correlation filters. For example, using PCA to reduce hundreds of correlated sensor readings into a few principal components while retaining most variance.
Practical tip: Always validate new features on a separate dataset or through cross-validation to ensure they genuinely improve model performance and do not introduce overfitting.
Machine learning foundations
Machine learning enables systems to learn patterns from data and make predictions. Understanding its fundamentals is critical for interpreting results and building robust models.
The following are the core methods:
Learning paradigms: Defines how models learn from data. Supervised learning uses labeled data to predict outcomes (e.g., predicting house prices from historical sales data). Unsupervised learning finds patterns in unlabeled data (e.g., customer segmentation based on purchasing behavior). Reinforcement learning learns by receiving rewards or penalties for actions (e.g., training a robot to navigate a maze).
Algorithms: Methods used to model data patterns. Regression models relationships between variables; linear regression predicts continuous outcomes (e.g., sales revenue), and logistic regression predicts binary outcomes (e.g., fraud or not fraud). Tree-based methods use hierarchical decision rules (e.g., Decision trees, random forest, XGBoost). SVMs and ensemble methods improve performance by finding optimal boundaries or by combining multiple models (e.g., random forests improve accuracy over a single decision tree).
Model evaluation: Measures how well a model performs using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC (e.g., evaluating a spam classifier using F1-score to balance false positives and false negatives).
Overfitting and generalization: Ensures models perform well on unseen data. Techniques include regularization, cross-validation, and monitoring the bias-variance tradeoff (e.g., using L2 regularization in linear regression to prevent overfitting).
Practical tip: Start simple, validate thoroughly, and iteratively improve models.
This free PDF provides a quick guide to core data science concepts, covering statistics, data preprocessing, feature engineering, and machine learning fundamentals. It’s designed to help you quickly refresh key ideas and reference essential formulas and techniques.
Common mistakes data scientists make
Even experienced data scientists can fall into predictable traps. Awareness helps avoid these pitfalls:
Ignoring or incorrectly imputing missing data can bias results. In some cases, such as fraud detection, missing values are informative and should be flagged rather than filled.
Failing to check distributions or detect outliers may cause models to underperform or produce misleading insights.
Confusing correlation with causation can lead to incorrect conclusions about relationships in the data.
Overfitting by using overly complex models can reduce generalization to new, unseen data.
Allowing data leakage between training and test sets can artificially inflate model accuracy.
Skipping feature scaling or proper encoding of variables can diminish the performance of machine learning algorithms.
Tip: Always validate assumptions, double-check transformations, and simulate real-world scenarios whenever possible.
With this data science cheatsheet, we can quickly refresh core concepts and build a strong foundation for practical data science tasks.