Changing Probabilities

Learn how changing probabilities works and how to use the technique in making custom generators.

Why is changing probabilities required?

To understand why and when changing probabilities is useful, let’s take a look at an example. Let’s imagine that we have a generator that looks for ISO Latin1 strings which will let us restrict the range of string().

With transforms, filters, and resizes, we can get pretty far in terms of retargeting our generators to do what we want. The latin1 generator shows us something interesting though. The default string() generator has a large search space, and therefore filtering out the unwanted data can be expensive. On the other hand, most Unicode characters can’t be represented within latin1, and transforming the generated strings themselves would also be expensive. For instance, how would we map emojis to latin1 characters?

We can’t solve this problem efficiently with transforms and filters alone, and for this specific issue, resizing wouldn’t be of much help either. Instead, we’ll have to build our own generators while controlling probabilities to make them do what we want.

Changing probabilities

The last fundamental building block that really gives us control over data generation is having the ability to tweak the probabilities of how data is generated. By default, the generators provided either generate a large potential space, like string(), number(), or binary(), or in a rather narrow scope, such as boolean() or range(X,Y).

Using let() allows us to transform all of the data, and such_that() allows us to remove some of it. But it’s difficult to achieve a middle ground between the two. When we truly need a custom solution, probabilistic generators can help.

We had a look a oneof(ListOfGenerators) already in Collecting lesson, which helped us gain more repeatable keys in the following generator:

def key(), do: oneof([range(1,10), integer()])

This shows how two distinct generators can be used together to help build and steer things in the direction we want. The oneof(Types) generator is simple and useful, but the most interesting generator is frequency(), which allows us to control and choose the probability of each generator it contains.

Let’s take strings as an example since they were already causing us problems. Just using string() tended to yield a lot of control characters, extremely variable codepoints, and very little in terms of the latin1 or ASCII characters. Let’s look at how frequency() can be used to help us with our problem.

Get hands-on with 1200+ tech skills courses.