Data Science and Machine Learning Interview Handbook/

...

Time Series Data

Practice implementing ARIMA, incorporating seasonality with SARIMA, and handling outliers using statistical techniques.

We'll cover the following...

Time series data plays a vital role in monitoring sensor-driven environments like manufacturing. From forecasting machine behavior to catching unusual patterns, the ability to model trends and clean noisy signals is a core data science skill. Let’s get started.

Implement ARIMA for manufacturing sensor data

In manufacturing environments, sensors continuously monitor equipment conditions such as temperature, vibration, pressure, etc. This implementation helps predict future values and detect trends. To support predictive maintenance and identify anomalies, you’re asked to forecast future sensor readings using a classic statistical model: ARIMAUntitledConcept1.

This is a question frequently asked by industrial analytics companies like GE Digital, Siemens, and manufacturing-focused AI startups.

Can you show how you would implement an ARIMA model for manufacturing sensor data? The implement_arima() function starts at line 8.

Implement an ARIMA model to forecast future values of a manufacturing sensor time series.
Ensure the data is stationary before modeling.
Generate predictions and evaluate performance using MSE.

Python 3.10.4

import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

def implement_arima(sensor_data, order=(1,1,1)):
    """
    Implement ARIMA model for manufacturing sensor data to predict future values
    
    Parameters:
    sensor_data (pd.Series): Time series of sensor readings
    order (tuple): ARIMA order (p,d,q)
    
    Returns:
    tuple: (model, predictions, mse)
    """
    #TODO - your implementation here
    # Ensure data is stationary
    # Difference data if not stationary
    # Fit ARIMA model
    # Make predictions
    # Calculate MSE for the overlapping period
    return fitted, predictions, mse

def initialize_arima_data():
    """Initialize sample manufacturing sensor data"""
    np.random.seed(42)
    dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
    base_signal = np.sin(np.linspace(0, 100, 1000)) * 10
    noise = np.random.normal(0, 1, 1000)
    sensor_data = pd.Series(base_signal + noise, index=dates)
    return sensor_data

# Example usage
sensor_data = initialize_arima_data()
model, predictions, mse = implement_arima(sensor_data)
print(f"ARIMA MSE: {mse:.2f}")

Sample answer

Here’s how you might structure your response:

Preprocess the sensor data
1. Start by verifying if the time series is stationary using the Augmented Dickey-Fuller (ADF) test.
2. If it’s not stationary, apply differencing iteratively until the test indicates stationarity.
3. Ensure missing values from differencing are handled e.g., drop NaNs.
Select ARIMA order parameters
1. Choose (p, d, q) manually or use domain knowledge/defaults for simplicity.
2. In interviews, be prepared to explain why d=1 is often a good starting point for non-stationary data.
Fit the model
1. Fit an ARIMA model to the original (not differenced) series, as the library handles differencing internally.
2. Catch fitting errors or convergence issues and mention fallback strategies if the model doesn’t converge e.g., simplifying parameters.
Make and evaluate predictions
1. Predict a short future window e.g., next 10-20 steps.
2. Compare predictions against known values to compute performance metrics like MSE.
3. Mention how residuals or prediction intervals could also be used for model evaluation or anomaly detection.
Explainability and trade-offs
1. In interviews, highlight that ARIMA is interpretable and well-suited for short-term forecasting, but may struggle with long-range trends or complex seasonality.

Here’s the solution code:

Python 3.10.4

import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller

def implement_arima(sensor_data, order=(1,1,1)):
    """
    Implement ARIMA model for manufacturing sensor data to predict future values
    
    Parameters:
    sensor_data (pd.Series): Time series of sensor readings
    order (tuple): ARIMA order (p,d,q)
    
    Returns:
    tuple: (model, predictions, mse)
    """
    # Ensure data is stationary
    def check_stationarity(data):
        result = adfuller(data)
        return result[1] < 0.05
    
    # Difference data if not stationary
    data = sensor_data.copy()
    while not check_stationarity(data):
        data = data.diff().dropna()
    
    # Fit ARIMA model
    model = ARIMA(sensor_data, order=order)
    fitted = model.fit()
    
    # Make predictions
    predictions = fitted.predict(start=len(sensor_data)-10, end=len(sensor_data)+10)
    
    # Calculate MSE for the overlapping period
    mse = np.mean((predictions[:10] - sensor_data[-10:])**2)
    
    return fitted, predictions, mse

def initialize_arima_data():
    """Initialize sample manufacturing sensor data"""
    np.random.seed(42)
    dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
    base_signal = np.sin(np.linspace(0, 100, 1000)) * 10
    noise = np.random.normal(0, 1, 1000)
    sensor_data = pd.Series(base_signal + noise, index=dates)
    return sensor_data

# Example usage
sensor_data = initialize_arima_data()
model, predictions, mse = implement_arima(sensor_data)
print(f"ARIMA MSE: {mse:.2f}")

In the solution above:

Lines 18–20: Define a helper function check_stationarity using the Augmented Dickey-Fuller (ADF) test. It returns True if the p-value is below 0.05, indicating the time series is stationary.
Lines 23–25: We copy the original sensor data and repeatedly apply differencing (.diff().dropna()) until the series becomes stationary. This ensures the ARIMA model can be fit properly.
Line 28: Fit the ARIMA model to the original (non-differenced) sensor data using the provided (p,d,q) order.
Line 32: Predict values starting from 10 steps before the end of the data to 10 steps after. This includes both a backtest and a short forecast.
Line 35: Compute Mean Squared Error (MSE) for the last 10 time steps by comparing predictions to the actual sensor values, providing a quick evaluation of model accuracy on recent data.

Augment the model with seasonality

Manufacturing processes often show cyclical patterns due factors like shift changes, daily temperature variations, or equipment warm-up/cooldown cycles. Can you augment your answer to Question 1 with seasonality? The function implement_sarima() starts at line 6.

Python 3.10.4

import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller

def implement_sarima(sensor_data, order=(1,1,1), seasonal_order=(1,1,1,24)):
    """
    Implement Seasonal ARIMA model for manufacturing sensor data
    
    Parameters:
    sensor_data (pd.Series): Time series of sensor readings
    order (tuple): ARIMA order (p,d,q)
    seasonal_order (tuple): Seasonal order (P,D,Q,s)
    
    Returns:
    tuple: (model, predictions, mse)
    """
    #TODO - your implementation here
    # Fit SARIMA model
    # Make predictions
    # Calculate MSE
    return fitted, predictions, mse

def initialize_seasonal_data():
    """Initialize sample seasonal manufacturing sensor data"""
    np.random.seed(42)
    dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
    # Base signal with daily seasonality
    base_signal = np.sin(np.linspace(0, 100, 1000)) * 10
    daily_pattern = np.sin(np.linspace(0, 2*np.pi*41.67, 1000)) * 5  # 41.67 cycles for 1000 hours
    noise = np.random.normal(0, 1, 1000)
    sensor_data = pd.Series(base_signal + daily_pattern + noise, index=dates)
    return sensor_data

# Example usage
seasonal_data = initialize_seasonal_data()
seasonal_model, seasonal_predictions, seasonal_mse = implement_sarima(seasonal_data)
print(f"SARIMA MSE: {seasonal_mse:.2f}")

Sample answer

Here’s how you may structure your response:

Prepare seasonal time series
1. Ensure the dataset shows repeating patterns e.g., daily cycles every 24 hours.
2. Mention that we can optionally visualize it to confirm seasonality before proceeding.
Define SARIMA parameters
1. Choose a seasonal order (P, D, Q, s).
  1. P = seasonal AR terms
  2. D = seasonal differencing (often 1)
  3. Q = seasonal MA terms
  4. s = season length (e.g., 24 for hourly data with daily cycles)
2. Combine this with the standard ARIMA order (p, d, q).
Fit the SARIMA model
1. Use the seasonal_order argument in your model implementation.
2. Be aware that fitting SARIMA may take longer—mention this trade-off.
Generate predictions
1. Forecast a reasonable window, keeping in mind that seasonality may introduce lag or delay in how the model responds.
Evaluate and compare
1. Use MSE for evaluation, but also highlight whether the model captures seasonal peaks and valleys accurately.
2. In interviews, explain how SARIMA improves over standard ARIMA for periodic data, and when it’s worth the added complexity.

Here’s the solution code:

Python 3.10.4

import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller

def implement_sarima(sensor_data, order=(1,1,1), seasonal_order=(1,1,1,24)):
    """
    Implement Seasonal ARIMA model for manufacturing sensor data
    
    Parameters:
    sensor_data (pd.Series): Time series of sensor readings
    order (tuple): ARIMA order (p,d,q)
    seasonal_order (tuple): Seasonal order (P,D,Q,s)
    
    Returns:
    tuple: (model, predictions, mse)
    """
    # Fit SARIMA model
    model = SARIMAX(sensor_data, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
    fitted = model.fit(disp=False)
    
    # Make predictions
    predictions = fitted.predict(start=len(sensor_data)-10, end=len(sensor_data)+10)
    
    # Calculate MSE
    mse = np.mean((predictions[:10] - sensor_data[-10:])**2)
    
    return fitted, predictions, mse

def initialize_seasonal_data():
    """Initialize sample seasonal manufacturing sensor data"""
    np.random.seed(42)
    dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
    # Base signal with daily seasonality
    base_signal = np.sin(np.linspace(0, 100, 1000)) * 10
    daily_pattern = np.sin(np.linspace(0, 2*np.pi*41.67, 1000)) * 5  # 41.67 cycles for 1000 hours
    noise = np.random.normal(0, 1, 1000)
    sensor_data = pd.Series(base_signal + daily_pattern + noise, index=dates)
    return sensor_data

# Example usage
seasonal_data = initialize_seasonal_data()
seasonal_model, seasonal_predictions, seasonal_mse = implement_sarima(seasonal_data)
print(f"SARIMA MSE: {seasonal_mse:.2f}")

Key components of this implementation:

Line 19: Create an SARIMA model instance using both regular order and seasonal_order parameters. The seasonal_order=(P,D,Q,s) captures repeating seasonal patterns—in this case, hourly data with daily (24-hour) seasonality. We disable stationarity and invertibility constraints to allow the model more flexibility during fitting.
Line 20: Fit the model to the input sensor data using .fit() and suppress output with disp=False for a cleaner run. This estimates the optimal parameters internally.
Line 23: Use .predict() to generate forecasts from 10 time steps before the end of the dataset to 10 steps after, giving both a backtest and a short future prediction.
Line 26: Compute the Mean Squared Error (MSE) for the 10-step backtest portion by comparing predictions to actual values from the sensor data.
Line 28: Return the fitted model, full prediction series, and the MSE for evaluation.

Handling outliers with sensor data

Sensor data frequently contains anomalies due to measurement errors, equipment malfunctions, or genuine process deviations. How will you handle outliers in sensor data?

Sample answer

Let’s explore some key components of handling outliers in sensor data. We will need:

Rolling stats: Use a rolling window (e.g., 24 hours) to calculate mean and standard deviation.
Z-Score detection: Flag values where Z-score > threshold (commonly three standard deviations).
Replacement strategy: Replace outliers with the local rolling mean.
Reporting: Return the cleaned data, index of outliers, and replaced values for validation.

Let’s look at a sample snippet that allows us to demonstrate this using numpy and pandas.

Python 3.10.4

import numpy as np
import pandas as pd


def handle_outliers(sensor_data, threshold=3):
    """
    Handle outliers in manufacturing sensor data using robust statistical methods
    
    Parameters:
    sensor_data (pd.Series): Time series of sensor readings
    threshold (float): Z-score threshold for outlier detection
    
    Returns:
    tuple: (cleaned_data, outlier_indices, replaced_values)
    """
    # Calculate rolling statistics
    rolling_mean = sensor_data.rolling(window=24, center=True).mean()
    rolling_std = sensor_data.rolling(window=24, center=True).std()
    
    # Calculate z-scores
    z_scores = np.abs((sensor_data - rolling_mean) / rolling_std)
    
    # Identify outliers
    outlier_indices = z_scores > threshold
    
    # Replace outliers with rolling mean
    cleaned_data = sensor_data.copy()
    cleaned_data[outlier_indices] = rolling_mean[outlier_indices]
    
    # Store replaced values for verification
    replaced_values = sensor_data[outlier_indices]
    
    return cleaned_data, outlier_indices, replaced_values

def initialize_outlier_data():
    """Initialize sample manufacturing sensor data with outliers"""
    np.random.seed(42)
    dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
    base_signal = np.sin(np.linspace(0, 100, 1000)) * 10
    noise = np.random.normal(0, 1, 1000)
    
    # Add artificial outliers
    outlier_indices = np.random.choice(1000, 50, replace=False)
    noise[outlier_indices] *= 10
    
    sensor_data = pd.Series(base_signal + noise, index=dates)
    return sensor_data

# Example usage for Question 3
outlier_data = initialize_outlier_data()
cleaned_data, outliers, replaced = handle_outliers(outlier_data)
print(f"Number of outliers detected: {outliers.sum()}")

Let’s look at it in detail:

handle_outliers(sensor_data, threshold=3): This function detects and corrects outliers in a time series of manufacturing sensor data using rolling statistics and z-score analysis.

Lines 17–18: Compute the rolling mean and standard deviation using a 24-hour window (assuming hourly data) with center=True to align the window around each point.
Line 21: Calculate z-scores for each data point, i.e., how many standard deviations away each value is from the local mean.
Line 24: Mark data points as outliers if their absolute z-score is greater than the specified threshold. The default threshold is three, which aligns with the “3-sigma rule” often used in statistical quality control.
Lines 27–28: Create a copy of the original data and replace outlier values with the corresponding values from the rolling mean.
Line 31: Store the original outlier values that were replaced.
Line 33: Return the cleaned data, a boolean mask of outlier positions, and the list of values that were replaced.

initialize_outlier_data(): This function simulates a noisy sensor dataset with injected outliers for testing purposes.

Lines 37–40: Create a sine-wave-based signal simulating a manufacturing process. Add standard Gaussian noise for realism.
Lines 43–44: Randomly select 50 time steps and amplify their noise by a factor of 10 to create artificial outliers. This step simulates sensor glitches or anomalies.
Line 46: Combine the base signal and noise into a pandas Series with hourly timestamps as the index.