Datasets
Standard Dataset
Multivariate Time Series Characterization and Forecasting of VoIP Traffic in Real Mobile Networks
- Citation Author(s):
- Submitted by:
- Mario Di Mauro
- Last updated:
- Sun, 07/16/2023 - 04:43
- DOI:
- 10.21227/jef5-4w68
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
Predicting the behavior of real-time traffic (e.g., VoIP) in mobility scenarios could help the operators to better plan their network infrastructures and to optimize the allocation of resources. Accordingly, we propose a forecasting analysis of crucial QoS/QoE descriptors (some of which neglected in the technical literature) of VoIP traffic in a real mobile environment. Please refer to our paper published on IEEE Transactions on Network and Service Management (https://ieeexplore.ieee.org/document/10184084) also available on ArXiv at: https://arxiv.org/pdf/2307.06645
We release an original real-world dataset used to perform the so-called "Multivariate time series prediction" possible both via statistical techniques (e.g. VAR) and Machine/Deep Learning (ML/DL) techniques. The dataset contains several features of cellular traffic organized into time series. The goal is to exploit statistical and learning-based techniques to predict the future behavior of a given feature.
The equipment we used to build the real-world dataset includes:
- 1 cellular device equipped with Linphone (open-source softphone supporting RTCP-XR protocol) representing the User Equipment 1 (UE1);
- 1 standard PC equipped with: i) Linphone softphone representing the User Equipment 2 (UE2), ii) the software probe Wireshark used to capture the network traffic between UE1 and UE2 and to save it in .pcap format.
The Dataset contains network traffic gathered in a real cellular environment around the city of Salerno (Italy) being classified as a medium-density city (around 2000 people/Km^2). Currently (Mar. 2023), such a territory is served by approximately 100 radio towers supporting a mix of LTE/LTE-Advanced (about 97%) and 5G-NSA (about 3%) technologies (data gathered from https://www.nperf.com/en/map/IT/).
We provide both
- raw data (.pcap) available at: https://drive.google.com/file/d/1-r2Xd1VK6r7O_1KaXVYPus1Rcj6TF9DF/view?u...
- processed data (.txt) available at: https://github.com/mariodim/ml_mobile_dataset/blob/main/ML_TimeSeries_DA...
The whole dataset is split into 16 sub-datasets divided per codec and per network scenario:
- 8 codecs: G.722, G.729, GSM, G.711, Mpeg4-16, OPUS, Speex-8, Speex-16.
- 2 network scenarios:
Mobile (UE1 communicates with UE2 from a moving car at an average speed of 60 Km/H);
Fixed (UE1 communicates with UE2 being fixed in a place).
Please note that, for space constraints, in our paper we analyze mobile scenario with a subset of codecs.
Each sub-dataset is the result of a post-processing stage on the raw .pcap files produced by Wireshark.
Each sub-dataset contains 6 temporal features organized in columns (the first column is the time reference):
- MOS (Mean Opinion Score) --> it measures the call quality (expressed in a pure value between 1 and 5);
- BW (Bandwidth) --> it measures the bandwidth consumed by a voice call ( expressed in kb/s);
- RTT (Round Trip Time) --> it measures the interval between a sent and a received packet (expressed in ms);
- JTR (Jitter) --> it measures the inter-packet jitter (expressed in ms);
- DJB (De-jittering Buffer) --> it measures the buffer length used to reduce jitter (expressed in ms);
- SNR (Signal-to-Noise ratio) --> it measures the objective quality of the communication channel (expressed in dB).
We have developed a Python routine that performs the multivariate time series prediction of features by using different techniques
Such a routine is available at the following link: https://colab.research.google.com/drive/1pe-p8yEP8QaVgWcOpVZ2ZwweJEqAjHh...
Please note that you have to upload a given sub-dataset in the same google Colab Notebook directory containing the routine.
After uploading a sub-dataset (e.g. mob_g722.txt, meaning that the traffic is collected within the mobile scenario and the codec used is G.722), set the parameters in the first "cell" of the Python code:
- filename --> insert the name of the uploaded file (e.g. "mob_g722.txt")
- methods --> you can choose one of the implemented techniques for time series prediction by setting True or False
- param --> size of your ML network (number of dense neurons, number of units, epochs, etc.)
- perc_train --> percentage of training size (the test size is set accordingly)
- n_past --> number of past values used in the training set
- n_fut --> number of future samples to be predicted (default = 1)
Output files include:
- TXT files containing time series predictions per technique --> e.g. the output file mob_g722_cnn.txt is a 12-column file in this format: column #1 contains original values of MOS, column #2 contains predicted values of MOS, column #3 contains original values of BW, column #4 contains predicted values of BW, and so forth. Once exported, such files can be obviously used to reproduce the plots through different plot tools;
- RMSE, MAE, MAPE values per each technique
- Information about training time per each technique (directly shown in the output code).
Comments
MULTIVARIATE TIME SERIES CHARACTERIZATION AND FORECASTING OF VOIP TRAFFIC IN REAL MOBILE NETWORKS