Get to Know the Problem

Explore supervised learning by solving a real-life problem and mapping the data onto 2D-graph.

The problem statement

Our friend owns a cozy little pizzeria in a busy metropolitan city. Every day at noon, they check the number of reserved seats and decide how much pizza dough to prepare for dinner. Too much dough, and it goes wasted, but too little, and they run out of pizzas. In either case, the restaurant loses money.

It’s not always easy to gauge the number of pizzas from the reservations. Many customers don’t reserve a table, or they eat something other than pizza. The owner knows that there is some kind of link between those numbers, in that more reservations generally mean more pizzas, but other than that, it’s not clear what the exact relation is.

The restaurant owner wants a program that looks at historical data, grasps the relation between reserved seats and pizzas and uses it to forecast tonight’s pizza sales from today’s reservations. Can we code such a program for them?

Supervised pizza

Remember what we learned back in Supervised Learning’s lesson? We can solve the pizza forecasting problem by training a supervised learning algorithm with a bunch of labeled examples. To get the examples, we ask the restaurant owner to jot down a few days’ worth of reservations and pizzas and collect those data in a file. Here’s what the first four lines of that file look like:

Reservations Pizzas
13 33
2 16
14 32
23 51
02_first/pizza.txt

The file contains 30 lines of data. Each is an example, composed of an input variable (the reservations) and a numerical label (the pizzas). Once we have an algorithm, we can use these examples to train it. Later on during the prediction phase, we can pass a specific number of reservations to the algorithm and ask it to come up with a matching number of pizzas.

Let’s start with the numbers as a data scientist would.

Make sense of the data

If we glance at the pizza examples, it seems that the reservations and pizzas are correlated.

The NumPy library has a convenient function to import whitespace-separated data from text:

import numpy as np
X, Y = np.loadtxt("pizza.txt", skiprows=1, unpack=True)

The first line imports the NumPy library, and the second uses NumPy’s loadtxt() function to load the data from the pizza.txt file. Then we skip the headers row, and “unpack” the two columns into separate arrays called XX and YY. XX contains the values of the input variable, and YY contains the labels. We use uppercase names for XX and YY, because that’s a common Python convention to indicate that a variable should be treated as a constant.

Let’s peek at the data to make sure they are loaded okay. If we wish to follow along, send the two lines given before, and then check out the first few elements of XX and YY:

XX[0:5]
[ 13. 2. 14. 23. 13.]
YY[0:5]
[ 33. 16. 32. 51. 27.]

The numbers are consistent with Roberto’s file, but it’s still hard to make sense of them. Plot them on a chart for clarity:

Now the correlation jumps out at us: the more reservations, the more pizzas. To be fair, a statistician might scold us for drawing conclusions from a handful of hastily collected examples. However, this is no research project, so let’s ignore the little statistician on our shoulder and build a pizza forecaster in the next lesson.

Note: In case we wonder how to plot this graph, play with the following Plotting Code.

main.py
pizza.txt
# Plot the reservations/pizzas dataset.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sea
sea.set()
plt.axis([0, 50, 0, 50]) # scale axes (0 to 50)
plt.xticks(fontsize=14) # set x axis ticks
plt.yticks(fontsize=14) # set y axis ticks
plt.xlabel("Reservations", fontsize=14) # set x axis label
plt.ylabel("Pizzas", fontsize=14) # set y axis label
X, Y = np.loadtxt("pizza.txt", skiprows=1, unpack=True) # load data
plt.plot(X, Y, "bo") # plot data
plt.show() # display chart