How to predict website traffic using Python

Website forecasting is a method used to predict the possible traffic on a website based on its history. Website traffic forecasting is done using the previous traffic record collected from the website. Now, websites can predict their traffic beforehand so the traffic length stays within the bandwidth. Also, it helps allocate resources and personnel to deal with issues based on traffic.

Traffic can also be used as a metric to decide a website’s reachability and predict the next step to improving it. In this Answer, we’ll discuss how machine learning algorithms can be used for website traffic forecasting.

Defining the dataset

The dataset used to forecast traffic on a website contains a history of views on a sample website for a decided period of time. The dataset has two columns to show the reachability. The column Date_traffic shows the date, and the Views_per_day column shows the total views on the website on that specific date. The format for the date defined in the dataset is d/m/Y. The training data is stored in Traffic_record.csv.

Website traffic forecasting process

The website traffic forecasting process has a series of steps, starting with installing the libraries and plotting the dataset to predict the necessary variable values used in model training. The model is trained, the traffic is predicted, and the results are plotted.

  1. Installing dependencies: To perform the website forecasting process, certain dependencies are required. In Python3, we use pip3 to install the required dependencies. For this specific process, we require matplotlibpandas, and statsmodels. To install the dependencies, we use the following command:

pip3 install matplotlib
pip3 install pandas
python3 -m pip install statsmodels
  1. Importing libraries: The next step is to import the libraries into the Python file; this will help use the libraries in the code. To import the libraries, we add the following statements to the code:

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_pacf
import statsmodels.api as sm

In the code:

  • Line 1: We import pandas for loading the dataset in the Python file.

  • Line 2: We import matplotlib for plotting the dataset.

  • Lines 3–5: We import statsmodel to use its API to train the SARIMAX model. Also, plotting libraries are used to predict the values p, q, and d for the model.

  1. Preparing the data: For model training we read the dataset, but it is also important to format the dataset. We format the dates used in the dataset from strings to d/m/Y format. To do this, we add the following lines of code to the Python file.

traffic_history = pd.read_csv("Traffic_record.csv")
print(traffic_history.head())
# Format the date
traffic_history["Date_traffic"] = pd.to_datetime(traffic_history["Date_traffic"], format="%d/%m/%Y")
print(traffic_history.info())

In the code:

  • Line 1: We read the dataset from the CSV file, Traffic_record.csv and load it into the data frame traffic_history.

  • Line 2: We print the top 5 values loaded from the dataset.

  • Line 4: We use pandas to format the column Date_traffic in format %d/%m/%Y.

  • Line 5: We print information from columns in the dataset.

  1. Predicting p, d and qWe use three different plotting techniques to find the values of variables for the training of the SARIMAX model. SARIMAX is a statistical model that understands seasonal trends of data to predict future values in a seasonal period s. To determine the values of p, q, and d, we employ the following mechanisms:

Seasonal decomposition

Since the website traffic is not consistent, it is seasonal. For instance, there is more traffic on weekdays than on weekends on educational websites and the opposite for entertainment websites. So for seasonal traffic, we use the SARIMAX model and set the value of d equal to 1. To plot the graph to detect whether it is seasonal or stationary, we use the following lines of code:

seasonal_traffic = seasonal_decompose(traffic_history["Views_per_day"], model='multiplicative', period = 30)
figure_seasonal = plt.figure()
figure_seasonal = seasonal_traffic.plot()
figure_seasonal.set_size_inches(10, 10)

  In the code:

  Line 1: We use seasonal_decompose by the View_per_day column to plot the seasonal_traffic for a period of 30 days using the multiplicativeIn time series decomposition, the multiplicative model represents the components as multiplied together. model.

  Line 2–4: We plot the graph and define the size of the graph.

Autocorrelation

We use autocorrelation on the View_per_day column to detect the value of p. To do that, we use the following line of code:

pd.plotting.autocorrelation_plot(traffic_history["Views_per_day"])

The output graph is as follows:

Autocorrelation graph
Autocorrelation graph

Based on the output, since the curve is moving after the fifth horizontal line, we define the value of p equal to 5.

Partial autocorrelation

Now to find the value of q which is the moving average, we use partial autocorrelation of the View_per_day column. To do that, we use the following line of code:

plot_pacf(traffic_history["Views_per_day"], lags = 100)
Plotting the autocorrelation

The output of the graph is as follows:

Partial autocorrelation graph
Partial autocorrelation graph

Based on the output, only two points are far away from all the other points plotted in the graph. We define the value of q as 2.

  1. Model training: After calculating the values necessary for model training, we can finally train our SARIMAX model on the Views_per_day column. To do this, we use the following lines of code:

p, d, q = 5, 1, 2
model_used=sm.tsa.statespace.SARIMAX(traffic_history['Views_per_day'],order=(p, d, q),seasonal_order=(p, d, q, 12))
model_used=model_used.fit()
print(model_used.summary())

In the code:

  • Line 1: We define the value p, d and q.

  • Line 2: We define the model SARIMAX to use the View_per_day column, order represents the non-seasonal component and seasonal_order with the p, d and q values. In seasonal_order, we define s seasonal period as 12 representing months, meaning a year period.

  • Line 3: We train the model.

  • Line 4: We print the summary of the trained model.

  1. Predict for future: Now we use the trained model to predict the views on the website for the next 30 days. To do this, we use the following lines of code:

predicted_month = model_used.predict(len(traffic_history), len(traffic_history)+30)
print(predicted_month)

In the code:

  • Line 1: We use the trained model to predict the traffic on the website. We add 30 to the length of data to define a period of time.

  • Line 2: We print the predicted output.

  1. Plotting the prediction with history: Next, we can finally plot our predictions, the x-axis represents the the days and y-axis represents the traffic on website. The graph shows the traffic with the already fed history to the model using the following lines of code:

traffic_history["Views_per_day"].plot(legend=True, label="Traffic history", figsize=(10, 10))
predicted_month.plot(legend=True, label= "Future predictions")
Plotting the future predictions

Below is the running example of the following algorithm. Run it and navigate to the working model to test your custom data.

import React from 'react';
require('./style.css');

import ReactDOM from 'react-dom';
import App from './app.js';

ReactDOM.render(
  <App />, 
  document.getElementById('root')
);
Website traffic detection using SARIMAX in Python

Copyright ©2024 Educative, Inc. All rights reserved