Deep Learning for Network Traffic Prediction

Overview

Time series forecasting is a method that uses models fitted on historical data to predict future values of an observation. Network traffic prediction is a form of time series forecasting that allows computer network operators to manage their networks more efficiently. Accurate traffic prediction can improve a network’s performance significantly, in areas such as network congestion management, resource distribution and network volume alerts. Most neural network models find it difficult to learn the long-range temporal relationships in a dataset - primarily due to the non-stationary and non-linear qualities that time-series data presents. However, existing literature suggests that Long Short Term Memory (LSTM) models can capture the long and short-term trends in network traffic data. This paper implements three LSTM models for network traffic prediction on the SANREN. The South African National Research and Education Network(SANREN). SANREN is a country-wide network of educa tion and research institutions in South Africa. It is a high-speed network for science, education, research and innovation-based institutions, and has been phased in across the country since 2007. As a large, federated network, SANREN could benefit from an LSTM model that allows for preemptive network actions.

Next: The Problem

The Problem

Modern computer networks are facing the challenges associated with transferring unprecedented amounts of data. Furthermore, large computer networks exhibit volatile network traffic flows. Historically, network operators have attempted to use statistical learning methods to optimise their networks. However, as network flows increase, network operators require more efficient and accurate time-series processing methods to overcome network congestion, reduce access times, allocate bandwidth effectively and detect traffic anomalies.

It is important to consider the constraints and resources of a network when evaluating a candidate model for network traffic prediction. Both computational complexity and run time can be a limiting factor for less-resourced networks such as SANREN, which may result in different network traffic prediction models being better suited for networks of this class. LSTM derivative models, such as the stacked LSTM and Bidirectional LSTM, have been shown to out-perform base- line, vanilla LSTMs. Whilst the predictive performance of an LSTM is the primary focus of this project, the computational feasibility of LSTM and LSTM-derivative prediction models will also be investigated.

Lastly, three research questions were to be answered:

How does the SANREN traffic data vary with time and day in relation to the South African university calendar?
What is the computational cost of different LSTM architectures, given a required level of accuracy in predicting future traffic flows on the SANREN?
Which of the LSTM models, Bidirectional, Simple or Stacked, provides the highest accuracy when predicting future SANREN traffic data, subject to network constraints?

Next: Individual Work

Individual Work

The data pipeline was split up so that work could be parallelized. The Bidirectional, Simple and Stacked LSTM models were all developed in Python using Keras, whilst the preprocessing and data analysis was done using pandas and NumPy. In this project, the Simple LSTM served as a benchmark with which to compare our Bidirectional and Stacked LSTMs.

Preprocessing and Stacked LSTM

Antony Fleischer

Antony designed and implemented the preprocessing and analysis sections of the data pipeline. He also implemented and evaluated the stacked LSTM model.

Learn More

Simple and Bidirectional LSTM

Justin Myerson

Justin implemented and evaluated the Simple and Bidirectional LSTM models.

Learn More

After completing the implementation of the models indiviually, the optimised models were evaluated against one another to determine a model that would be applicable to the SANREN use-case.

Next: Model Comparisons

Model Comparison

Antony Fleischer and Justin Myerson

Before searching for the optimal models across the hyperparameter space, it is important to define how each model will be evaluated. Mean Absolute Error and Mean Squared Error are both measures of how close a model’s predicted val- ues are to the actual observed value. Furthermore, the Coefficient of Determination - R² - is used to evaluate the models. It shows the proportion of the variance in the response that can be explained by the inde- pendent features. In this case, it shows the proportion of the variation in Bytes that can be explained by the numerical features in the dataset. The training time of each LSTM was also used as a proxy for complexity, with the results of each model variation shown on the left.

The hyperparameter search allowed each model to be optimised independently. According to the information the training and validation sets provide, the optimised models are a 100-neuron simple LSTM over 125 epochs, a 50-neuron Bidirectional LSTM over 150 epochs, and a 50-neuron stacked LSTM over 100 epochs. Based on these hyperparameters, the prediction results on an unseen test set are provided on the left.

Next: Results

What We Found

The Results Won't Surpise You

The Stacked LSTM is the most accurate predictor of future traffic flows on the SANREN. It had the best test MSE by 40% and the highest test R² by 23%. However, it also takes the longest to train – nearly double the time as the Simple and Bidirectional models.
- A graphical representation of the Stacked LSTM's model performance can be seen on the right.
- If the Stacked LSTM’s training time - which is two times greater than both simpler models - is prohibitive, then a network provider may decide that the predictive performance of the simple LSTM model is sufficient.
Prediction accuracy for all models increased with sample size - at the cost of an increased training time. Prediction time on the other hand was consistently low in our testing.

The Bidirectional LSTM is the least accurate predictor and does not justify its additional complexity over the Simple LSTM.

The preliminary statistical analysis also determined that there is no obvious pattern linking SANREN network traffic flows with the university calendar, nor the day of the week.

A deeper dive into the mechanics of LSTM models, the implementation of the models and discussions on the results can be found in the resources below.