Grokking the Machine Learning Interview/

...

Training Data Generation

Let's look at methods for generating training data for the search ranking problem.

We'll cover the following...

Let’s generate training data for the search ranking ML model. Note that the term training data row and training example will be used interchangeably.

Training data generation for pointwise approach

📝 Pointwise approach: In this approach of model training, the training data consists of relevance scores for each document. The loss function looks at the score of one document at a time as an absolute ranking. Hence the model is trained to predict the relevance of each document for a query, individually. The final ranking is achieved by simply sorting the result list by these document scores.

While adopting the pointwise approach, our ranking model can make use of classification algorithms when the score of each document takes a small, finite number of values. For instance, if we aim to simply classify a document as relevant or irrelevant, the relevance score will be $0$ or $1$ . This will allow us to approximate the ranking problem by a binary classification problem.

Now let’s generate training data for the binary classification approximation.

Positive and negative training examples

We are essentially predicting user engagement towards a document in response to a query. A relevant document is one that successfully engages the searcher.

For instance, we have the searcher’s query: “Paris tourism”, and the following results are displayed on the SERP in response:

Paris.com
Eiffeltower.com
Lourvemusuem.com

We are going to label our data as positive/negative or relevant/irrelevant, keeping in mind the metric successful session rate, as shown in the following illustration.

Assumption

Let’s assume that the searcher did not engage with Paris.com but engaged with Eiffeltower.com. Upon clicking on Eiffeltower.com, they spent two minutes on the website and then signed up. After signing up, they went back to the ...