Training Data Generation
Explore how to generate training data for user engagement prediction models in feed-based systems. Understand balancing positive and negative examples, the impact of sampling on model calibration, and effective train-test splitting based on time intervals to improve model performance in real scenarios.
Your user engagement prediction model’s performance will depend largely on the quality and quantity of the training data. So, let’s see how you can generate training data for your model.
📝 Note that the term training data row and training example will be used interchangeably.
Training data generation through online user engagement
The users’ online engagement with Tweets can give us positive and negative training examples. For instance, if you are training a single model to predict user engagement, then all the Tweets that received user engagement would be labeled as positive training examples. Similarly, the Tweets that only have impressions would be labeled as negative training examples.
📝 Impression: If a Tweet is displayed on a user’s Twitter feed, it counts as an impression. It is not necessary that the user reads it or engages with it, scrolling past it also counts as an impression.
However, as you saw in the architectural components lesson, that you can train different models, each to predict the probability of occurrence of different user actions on a tweet. The following illustration shows how the same user engagement (as above) can be used to generate training data for separate engagement prediction models.
When you generate data for the “Like” prediction model, all Tweets that the user has liked would be positive examples, and all the Tweets that they did not like would be negative examples.
📝 Note how the comment is still a negative example for the “Like” prediction model.
Similarly, for the “Comment” prediction model, all Tweets that the user commented on would be positive examples, and all the ones they did not comment on would be negative examples.
Balancing positive and negative training examples
Models essentially learn behavior from the data we present them with. Therefore, it’s important for us to provide a good sample of both positive and negative examples to model these interactions between different actors in a system. In the ...