Search⌘ K
AI Features

Pillars of Data Science

Understand the key pillars of data science including statistics, probability, mathematics, machine learning, and deep learning. Learn how these foundations support data processing, feature extraction, predictive modeling, and real-world applications. This lesson helps you grasp essential concepts that drive data-driven decision-making and develop practical skills in data science methodologies.

Data science relies on several fundamental principles that serve as the foundation or building blocks for its practice. These building blocks help extract insights and knowledge from data and determine what model is required for a problem. These building blocks include statistics, probability, and linear algebra, along with ML and DL models.

Statistics and probability

Statistics and probability provide a basis for many data processing, feature transformation, visualization, analysis, and evaluation techniques. Statistics help us collect, organize, and analyze data, such as descriptive or quantitative analysis (summarizing data). We can find data trends and their spread with variance, covariance, and standard deviation. We can find how data is centered with mean, median, and mode. We can find the relative skewness with quartiles. Probability also helps us find patterns and trends that lie within data and transform them, such as inferential analysis (making guesses from data). We can find data distribution in its confidence interval, perform hypothesis or A/B testing, and make predictions with the central limit theorem (CLT). With Bayes’ theorem, we can find model uncertainty and priori probability, estimate model parameters, and make inferences. Statistics and probability also help us find the statistical significance of results with parametric testing, such as p-value testing, t-tests, analysis of variance (ANOVA), and so on.

Data science usage of statistics and probability in the real world
Data science usage of statistics and probability in the real world

In this lesson, we’ll discuss some other examples of statistical and probability models. One of them is Markov chain model, a stochastic model used to define a sequence of possible events, where the probability of each event depends on the state of previous events. It provides an estimation of how likely a series of events would happen. Some application areas of Markov chain are predicting network traffic and/or security breaches, analyzing queueing systems, analyzing DNA and protein sequences, text generation and speech recognition, and recommendation systems.

Another famous example is that of autoregressive models, a statistical model that predicts a future event using a linear combination of past states of that event. Some applications of autoregressive models are predicting stock prices, predicting pandemics, analyzing GDP and inflation, demand forecasting, climate modeling, and resource allocation. We also have a huge variety of techniques to assess the model’s performance, such as cross-validation, resample methods, statistical significance testing, and Monte Carlo simulations.

These are only a few examples; the world of data science is filled with many more statistical and probability techniques. That’s why statistics and probability are considered one of the pillars of data science.

Mathematics and linear algebra

Mathematics and linear algebra play a fundamental role in data science, providing the theoretical and computational techniques for data modeling and analysis. They are the forces behind NLP and digital image processing (DIP). They provide a basis for ML algorithms, dimension reduction, factorization, and optimization techniques. We can use linear algebra to represent data in matrices, where each instance can be considered as a vector. It allows for the swift application of mathematical operations on the data. We can use it for principal component analysis (PCA). We can find latent patterns and structures in data with singular value decomposition (SVD) and nonnegative matrix factorization. We can process word embeddings and perform topic modeling by representing text data in vector spaces.

For example, in the context of ML, data is often represented as structured numerical arrays. Linear algebra helps in the manipulation and analysis of these arrays, simplifying the process of model creation and assessment. In NLP, linear algebra is used to convert words into numerical vectors. Because a machine can’t understand words, this is the only way that we’re able to work on languages.

Linear algebra is the backbone for NLP and DIP. In NLP, it helps to represent data into numerical arrays and word embeddings so that models can understand and learn the patterns of natural languages. It allows us to transform images into numeric matrices where each element corresponds to a pixel’s color or intensity. With the help of linear, fourier, and discrete cosine transformations, we can convert the images from the spatial domain to frequency domain. We can detect object boundaries, extract features, and perform image segmentation with linear algebra techniques.

Natural language understanding and generation
Natural language understanding and generation

When it’s necessary to not just simplify but also preserve the crucial data, dimension reduction techniques, such as PCA, are applied. These techniques are mostly built upon the principles of linear algebra. Furthermore, loss functions are primarily built upon the concepts of linear algebra. These functions are used for optimizing ML algorithms.

Major platforms that use recommendation algorithms, such as Amazon and Netflix, leverage linear algebra to provide personalized recommendations to users. Overall, we can safely say that data science is incomplete without linear algebra. While a deeper understanding of linear algebra isn’t always essential for applying data science techniques, increased familiarity with the concepts can simplify the field.

Machine learning and deep learning

ML and DL concepts are perhaps the most attractive parts of data science. ML and DL are subfields of AI that focus on the development of algorithms and models to learn from data and make predictions. They are widely used in data science to improve the analysis of data because they can be made to learn different patterns, trends, and information that might normally remain unnoticed.

ML can be divided broadly into three categories: supervised, unsupervised, and reinforcement learning. These form the cornerstone of ML, each addressing distinct challenges and applications in data analysis and decision-making. Let’s discuss them briefly.

  • Supervised learning: In supervised learning, ML models are trained to map input data to desired output based on labeled sample instances. We can think of supervised learning as how we use flashcards for learning. The common supervised learning techniques include classification and regression. Classification is used where the output is discrete classes, for example, email classification (spam/not spam) and sentiment analysis (positive/negative/neutral). Regression is used where the output is continuous, for example, house price estimation and stock price estimation.

  • Unsupervised learning: This is a type of ML that explores datasets for hidden patterns without requiring human guidance. It’s particularly useful in scenarios where labeled data is scarce or unavailable. Some common unsupervised learning techniques include clustering, dimension reduction, and principal component analysis. Clustering is used for market segmentation and social network analysis, and dimension reduction is used in speech recognition and bioinformatics. PCA helps to visualize multidimensional data.

  • Reinforcement learning: This is a learning paradigm where a model learns to make sequences of decisions by interacting with an environment to maximize a cumulative reward. For example, consider how we train a pet dog to do tricks. We use rewards (treats) for desirable behaviors so that the dog can learn that certain behaviors lead to rewards, while others don’t. In order to maximize treats (the long-term goal), the dog optimizes its behavior by repeating the rewarding actions. The same is true in reinforcement learning. Whether with dogs or models, it’s about learning through rewards and actions to achieve desired outcomes. Reinforcement learning is used in robotics, NLP, and image processing.

ML and DL
ML and DL

DL is a subset of ML that focuses on training artificial neural networks to perform various tasks. The neural networks consist of many hidden layers that enable them to learn and represent complex patterns in data. DL models create hierarchical representations of data where lower layers capture simple features, and higher layers learn more abstract and complex features. A fundamental difference between traditional ML and DL models is feature learning. DL models have the ability to extract the features from the data, whereas traditional ML models require handcrafted features. This makes DL models more complex and robust, promoting end-to-end training and scalability.

Data scientists utilize ML and DL models and train them with substantial amounts of data. These models can then be applied to a variety of tasks, such as prediction, generation, and analysis. Many companies engaged in data science leverage these models to improve their daily operations.

These models undergo a continuous cycle of updates and training depending on new data. Moreover, the world of AI is shifting from supervised learning to few-shot or zero-shot learning, and from training on domain-specific data to transfer learning. The constant change in the models implies that we won’t reach the limits of these models, making data science a hyperdynamic field.

Test yourself!

Let’s test your knowledge of the concepts covered in this lesson.

Technical Quiz
1.

In reinforcement learning, how does the model optimize its behavior?

A.

By clustering data

B.

By maximizing cumulative rewards

C.

By performing dimension reduction

D.

By minimizing the cumulative rewards


1 / 4