Search⌘ K
AI Features

Rules of Probability

Explore the foundational rules of probability crucial for generative AI. Understand how probability values are assigned, the difference between dependent and independent data, and how conditional and joint probabilities inform model design. Discover how Bayes' theorem links these concepts and the distinction between discriminative and generative models, with examples including neural networks like VAEs and GANs.

At the simplest level, a model, be it for machine learning or a more classical method such as linear regression, is a mathematical description of how various kinds of data relate to one another.

In the task of modeling, we usually think about separating the variables of our dataset into two broad classes:

  1. Independent data: It primarily means inputs to a model are denoted by XX. These could be categorical features (such as a 00 or 11 in six columns indicating which of six schools a student attends), continuous (such as the heights or test scores of the same students), or ordinal (the rank of a student in the class).

  2. Dependent data: It refers to the outputs of our models and are denoted by YY. As with the independent variables, these can be continuous, categorical, or ordinal, and they can be an individual element or multidimensional matrix (tensor) for each element of the dataset.

In some cases, YY is a label that can be used to condition a generative output, such as in a conditional GAN.

So, how can we describe the data in our model using statistics? In other words, how can we quantitatively describe what values we are likely to see, how frequently, and which values are more likely to appear together? One way is by asking the likelihood of observing a particular value in the data or the probability of that value. For example, if we were to ask what the probability is of observing a roll of 44 on a six-sided die, the answer is that, on average, we would observe a 4 once every six rolls. We write this as follows:

where PP denotes the probability of.

What defines the allowed probability values for a particular dataset? If we imagine the set of all possible values of a dataset, such as all values of a die, then a probability maps each value to a number between 00 and 11. The minimum is 00 because we can’t have a negative chance of seeing a result; the most unlikely result is that we would never see a particular value, or 0% probability, such as rolling a 7 on a six-sided die. Similarly, we can’t have greater than 100% probability of observing a result, represented by the value 11; an outcome with probability 11 is absolutely certain.

This set of probability values associated with a dataset belongs to discrete classes (such as the faces of a die) or an infinite set of potential values (such as variations in height or weight). In either case, however, these values have to follow certain rules, the probability axiomsUniversity of York. 2019. “University of York.” York.ac.uk. April 26, 2019. https://www.york.ac.uk/.‌:

  1. The probability of an observation (a die roll, a particular height, and so on) is a non-negative, finite number between 00 and 11.

  2. The probability of at least one of the observations in the space of all possible observations occurring is 11.

  3. The joint probability of distinct, mutually exclusive events is the sum of the probability of the individual events.

While these rules might seem abstract, we’ll later see that they have direct relevance to the development of neural network models. For example, an application of rule 1 is to generate the probability between 11 and 00 for a particular outcome in a softmax function—a mathematical function that converts a vector of real numbers into a probability distribution—for predicting target classes. Rule 3 is used to normalize these outcomes into the range 010-1, under the guarantee that they are mutually distinct predictions of a deep neural network (in other words, a real-world image logically can’t be classified as both a dog and a cat, but rather a dog or a cat, with the probability of these two outcomes additive). Finally, rule 2 provides the theoretical guarantees that we can generate data at all using these models.

However, in the context of machine learning and modeling, we are not usually interested in just the probability of observing a piece of input data, XX; we instead want to know the conditional probability of an outcome, YY, given the data, XX. In other words, we want to know how likely a label is for a set of data based on that data. We write this as the probability of YY given XX, or the probability of YY conditional on XX:

Another question we could ask about YY and XX is how likely they are to occur together or their joint probability, which can be expressed using the preceding conditional probability expression as follows:

This formula expressed the probability of XX and YY. In the case of XX and YY being completely independent of one another, this is simply their product:

These expressions become important in our discussion of complementary priors later on, and the ability of restricted Boltzmann machines to simulate independent data samples. They are also important as building blocks of Bayes’ theorem, which we'll discuss next.

Discriminative and generative modeling and Bayes’ theorem

Now let’s consider how these rules of conditional and joint probability relate to the kinds of predictive models that we build for various machine learning applications. In most cases—such as predicting whether an email is fraudulent or the dollar amount of the future lifetime value of a customer—we are interested in the conditional probability, P(YX=x)P(Y|X=x), where YY is the set of outcomes we are trying to model, XX represents the input features, and xx is a particular value of the input features. This approach is known as discriminative modeling“Machine Learning.” n.d. SpringerLink. Accessed December 28, 2023. https://www.springer.com/gp/book/9781402076473.. Discriminative modeling attempts to learn a direct mapping between the data, XX, and the outcomes, YY.

Another way to understand discriminative modeling is in the context of Bayes' theoremBayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. Philosophical Transactions of the Royal Society of London, 53, 370–418. https://doi.org/10.1098/rstl.1763.0053, which relates the conditional and joint probabilities of a dataset:

In Bayes’ formula, the expression P(XY)/P(X)P(X|Y)/P(X) is known as the likelihood or the supporting evidence that the observation XX gives to the likelihood of observing YYP(Y)P(Y) is the prior or the plausibility of the outcome, and P(YX)P(Y|X) is the posterior or the probability of the outcome given all the independent data we have observed related to the outcome thus far. Conceptually, Bayes’ theorem states that the probability of an outcome is the product of its baseline probability and the probability of the input data conditional on this outcome.

In the context of discriminative learning, we can see that a discriminative model directly computes the posterior; we could have a model of the likelihood or prior, but it is not required in this approach. Even though you may not have realized it, most of the models you have probably used in the machine learning toolkit are discriminative, such as the following:

  • Linear regression

  • Logistic regression

  • Random forests

  • Gradient-boosted decision trees (GBDT)

  • Support vector machines (SVM)

The first two (linear and logistic regression) model the outcome, YY, conditional on the data, XX, using a normal or Gaussian (linear regression) or sigmoidal (logistic regression) probability function. In contrast, the last three have no formal probability model—they compute a function (an ensemble of trees for random forests or GDBT, or an inner product distribution for SVM) that maps XX to YY, using a loss or error function to tune those estimates. Given this nonparametric nature, some authors have argued that these constitute a separate class of non-model discriminative algorithmsJebara, Tony., (2004). Machine Learning: Discriminative and Generative. Kluwer Academic (Springer). https://www.springer.com/gp/book/9781402076473 .

In contrast, a generative model attempts to learn the joint distribution P(Y,X)P(Y, X) of the labels and the input data. Recall that using the definition of joint probability:

We can rewrite Bayes’ theorem as follows:

Instead of learning a direct mapping of XX to YY using P(YX)P(Y|X), as in the discriminative case, our goal is to model the joint probabilities of XX and YY using P(X,Y)P(X, Y). While we can use the resulting joint distribution of XX and YY to compute the posterior, P(YX)P(Y|X), and learn a targeted model, we can also use this distribution to sample new instances of the data by either jointly sampling new tuples (x,y)(x, y), or sampling new data inputs using a target label, YY, with the following expression:

Types of generative models

Examples of generative models include the following:

Naive Bayes classifiers, though named as a discriminative model, utilize Bayes’ theorem to learn the joint distribution of XX and YY under the assumption that the XX variables are independent. Similarly, Gaussian mixture models describe the likelihood of a data point belonging to one of a group of normal distributions using the joint probability of the label and these distributions.

LDA represents a document as the joint probability of a word and a set of underlying keyword lists (topics) that are used in a document. Hidden Markov models express the joint probability of a state and the next state of data, such as the weather on successive days of the week. The VAE and GAN models also utilize joint distributions to map between complex data types. This mapping allows us to generate data from random vectors or transform one kind of data into another.

As already mentioned, another view of generative models is that they allow us to generate samples of XX if we know an outcome, YY. In the first four models in the previous list, this conditional probability is just a component of the model formula, with the posterior estimates still being the ultimate objective. However, in the last three examples, which are all deep neural network models, learning the conditional of XX dependent upon a hidden, or latent, variable, ZZ, is actually the main objective, to generate new data samples. Using the rich structure allowed by multi-layered neural networks, these models can approximate the distribution of complex data types such as images, natural language, and sound. Also, instead of being a target value, ZZ is often a random number in these applications, serving merely as an input from which to generate a large space of hypothetical data points. To the extent we have a label (such as whether a generated image should be of a dog or dolphin or the genre of a generated song), the model is P(XY=y,Z=z)P(X|Y=y, Z=z), where the label YY controls the generation of data that is otherwise unrestricted by the random nature of ZZ.