What is Thompson sampling?

Thompson sampling, also known as the Bayesian bandit algorithm, is a widely used technique in decision-making and reinforcement learning. This algorithm tackles the classic "multi-armed bandit" problem, where a gambler must decide which slot machine (or "arm") to pull to maximize their cumulative reward over time.

Multi-armed bandit problem

Imagine facing a row of slot machines. Each machine has a different and unknown probability of winning, representing its reward distribution. The objective is to identify the machine with the highest probability of winning and to maximize the total winnings by repeatedly pulling arms over time.

Thompson sampling approach

Thompson sampling takes a probabilistic approach to tackle the multi-armed bandit problem. At its core, the algorithm maintains a probability distribution for each arm, representing its underlying probability of winning. By maintaining probability distributions for each arm and continually updating them with new information, Thompson sampling efficiently explores and exploits different options, making it a powerful technique for smarter decision-making under uncertainty.

Thompson sampling in python

main.py

Data.csv

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
items = pd.read_csv('Data.csv')
import random
total_rounds = 10000
ads = 10
selection = []
track_1 = [0] * ads
track_0 = [0] * ads
final = 0
for round in range(0, total_rounds):
    chosen = 0
    probability = 0
    for index in range(0, ads):
        score = random.betavariate(track_1[index] + 1, track_0[index] + 1)
        if score > probability:
            probability = score
            chosen = index
    selection.append(chosen)
    received = items.values[round, chosen]
    if received == 1:
        track_1[chosen] = track_1[chosen] + 1
    else:
        track_0[chosen] = track_0[chosen] + 1
    final = final + received
plt.hist(selection)
plt.title('Frequency of Ad selections using Thompson sampling')
plt.xlabel('Ads')
plt.ylabel('Number of selections')
plt.savefig('output/image.png')
plt.show()

Code explanation

Line 1–3: Importing the necessary libraries. We use numpy for numerical computations, matplotlib.pyplot for creating plots, and pandas for handling data from the CSV file.
Line 5: Load the data from the CSV file Data.csv into a pandas DataFrame called items.
Line 8: Initializing total_rounds to specify the number of rounds the algorithm will run.
Line 9: Initializing ads to represent the number of arms (ads) to choose from.
Line 10: Initializing selection to keep track of selected arms.
Line 11–12: Initializing track_1 and track_0 to store the number of times each arm received a reward or no reward.
Line 13: Initializing final to accumulate the total rewards.
Line 14–22: Perform Thompson sampling to select arms in each round.
Line 23–28: Update reward tracking based on feedback.
Line 30–35: Plot the histogram of selected arms using Thompson sampling.

Output

The x-axis of the histogram represents the different arms (ads) available for selection, and the y-axis represents the number of times each arm (ad) was selected during the total number of rounds (specified as total_rounds in the code). The higher the count for a particular arm, the more frequently it was chosen by the Thompson sampling algorithm.

Applications

Thompson sampling has applications in various domains, including:

Online advertising: Optimizing the allocation of resources to different ad campaigns to maximize returns.
Clinical trials: Efficiently testing treatments or medications to identify the most effective.
Recommendation systems: Personalizing recommendations based on user interactions to improve engagement.
Resource allocation: Allocating resources like server capacity or inventory to maximize overall utility.

Conclusion

Thompson sampling solves the multi-armed bandit problem efficiently. By employing Bayesian statistics and probability distributions, Thompson sampling balances between exploring new options and exploiting well-known ones.

Free Resources