Rules of Probability
Explore the foundational rules of probability crucial for generative AI. Understand how probability values are assigned, the difference between dependent and independent data, and how conditional and joint probabilities inform model design. Discover how Bayes' theorem links these concepts and the distinction between discriminative and generative models, with examples including neural networks like VAEs and GANs.
We'll cover the following...
At the simplest level, a model, be it for machine learning or a more classical method such as linear regression, is a mathematical description of how various kinds of data relate to one another.
In the task of modeling, we usually think about separating the variables of our dataset into two broad classes:
Independent data: It primarily means inputs to a model are denoted by
. These could be categorical features (such as a or in six columns indicating which of six schools a student attends), continuous (such as the heights or test scores of the same students), or ordinal (the rank of a student in the class). Dependent data: It refers to the outputs of our models and are denoted by
. As with the independent variables, these can be continuous, categorical, or ordinal, and they can be an individual element or multidimensional matrix (tensor) for each element of the dataset.
So, how can we describe the data in our model using statistics? In other words, how can we quantitatively describe what values we are likely to see, how frequently, and which values are more likely to appear together? One way is by asking the likelihood of observing a particular value in the data or the probability of that value. For example, if we were to ask what the probability is of observing a roll of
where
What defines the allowed probability values for a particular dataset? If we imagine the set of all possible values of a dataset, such as all values of a die, then a probability maps each value to a number between
This set of probability values associated with a dataset belongs to discrete classes (such as the faces of a die) or an infinite set of potential values (such as variations in height or weight). In either case, however, these values have to follow certain rules, the p
The probability of an observation (a die roll, a particular height, and so on) is a non-negative, finite number between
and . The probability of at least one of the observations in the space of all possible observations occurring is
. The joint probability of distinct, mutually exclusive events is the sum of the probability of the individual events.
While these rules might seem abstract, we’ll later see that they have direct relevance to the development of neural network models. For example, an application of rule 1 is to generate the probability between
However, in the context of machine learning and modeling, we are not usually interested in just the probability of observing a piece of input data,
Another question we could ask about
This formula expressed the probability of
These expressions become important in our discussion of complementary priors later on, and the ability of restricted Boltzmann machines to simulate independent data samples. They are also important as building blocks of Bayes’ theorem, which we'll discuss next.
Discriminative and generative modeling and Bayes’ theorem
Now let’s consider how these rules of conditional and joint probability relate to the kinds of predictive models that we build for various machine learning applications. In most cases—such as predicting whether an email is fraudulent or the dollar amount of the future lifetime value of a customer—we are interested in the conditional probability,
Another way to understand discriminative modeling is in the context of
In Bayes’ formula, the expression
In the context of discriminative learning, we can see that a discriminative model directly computes the posterior; we could have a model of the likelihood or prior, but it is not required in this approach. Even though you may not have realized it, most of the models you have probably used in the machine learning toolkit are discriminative, such as the following:
Linear regression
Logistic regression
Random forests
Gradient-boosted decision trees (GBDT)
Support vector machines (SVM)
The first two (linear and logistic regression) model the outcome,
In contrast, a generative model attempts to learn the joint distribution
We can rewrite Bayes’ theorem as follows:
Instead of learning a direct mapping of
Types of generative models
Examples of generative models include the following:
Latent Dirichlet Allocation (LDA)
Deep Boltzmann machines (DBMs)
Variational autoencoders (VAEs)
General adversarial networks (GANs)
Naive Bayes classifiers, though named as a discriminative model, utilize Bayes’ theorem to learn the joint distribution of
LDA represents a document as the joint probability of a word and a set of underlying keyword lists (topics) that are used in a document. Hidden Markov models express the joint probability of a state and the next state of data, such as the weather on successive days of the week. The VAE and GAN models also utilize joint distributions to map between complex data types. This mapping allows us to generate data from random vectors or transform one kind of data into another.
As already mentioned, another view of generative models is that they allow us to generate samples of