Training Data Generation
Let’s generate training data for the search ranking ML model. Note that the term training data row and training example will be used interchangeably.
Training data generation for pointwise approach
📝 Pointwise approach: In this approach of model training, the training data consists of relevance scores for each document. The loss function looks at the score of one document at a time as an absolute ranking. Hence the model is trained to predict the relevance of each document for a query, individually. The final ranking is achieved by simply sorting the result list by these document scores.
While adopting the pointwise approach, our ranking model can make use of classification algorithms when the score of each document takes a small, finite number of values. For instance, if we aim to simply classify a document as relevant or irrelevant, the relevance score will be or . This will allow us to approximate the ranking problem by a binary classification problem.
Now let’s generate training data for the binary classification approximation.
Positive and negative training examples
We are essentially predicting user engagement towards a document in response to a query. A relevant document is one that successfully engages the searcher.
For instance, we have the searcher’s query: “Paris tourism”, and the following results are displayed on the SERP in response:
We are going to label our data as positive/negative or relevant/irrelevant, keeping in mind the metric successful session rate, as shown in the following illustration.
Let’s assume that the searcher did not engage with Paris.com but engaged with Eiffeltower.com. Upon clicking on Eiffeltower.com, they spent two minutes on the website and then signed up. After signing up, they went back to the SERP and clicked on Lourvemusuem.com and spent twenty seconds there.