Big Data Machine Learning Benchmark on Spark

Citation Author(s):
Federal University of Pernambuco
Federal University of Pernambuco
Submitted by:
Jairson Rodrigues
Last updated:
Thu, 06/06/2019 - 13:58
Data Format:
Creative Commons Attribution
1 rating - Please login to submit your rating.


We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets. The tested scenarios embraced problems of regression, clustering, classification, dimensionality reduction, and collaborative filtering. We performed experiments on 2.15 TB of synthetic data produced with Intel HiBench, in a cluster composed of 128 cores and 848 GB RAM managed by Apache Spark framework. We hope these datasets can be used by the scientific community to obtain insights about running algorithms on big data processing platforms.


The sections below explain the specification of the cluster of machines, the content of the data used to run the DML algorithms, the structure of the data metrics (logs) gathered, and the methods applied to collect the execution logs.


Datasets Structure

Each one of the 275 datasets corresponds to one execution of one DML in one specific volume (scale). To separate the data in an appropriated manner, the filenames are organized as <dml_algorithm>_<resource>_<scale>.csv, where:

  • <dml_algorithm> stands for one of eleven executed techniques (see the section about DML algorithms, below)
  • <resource> stands for disc, CPU, memory, processes, and network (see the section about metrics, in the sequence)
  • <scale> stands for b1, b2, gigantic, huge, and large (see the section about data volumes, in the sequence)


This way, a file named als_cpu_huge.csv designates the metrics of the CPU load of the Alternating Least Squares algorithm where applied to solve a problem in the "huge" scale. The same way, a file named kpar_mem_B1.csv stores the metrics for memory usage of the K-means Parallel (Dense k-means), and so on.


Computer Cluster Specification

The experiments were hosted on Google Cloud Data Proc environment. The cluster was composed of eight high power machines, a master node and seven slave nodes, totalizing 128 cores, integrated to the same internal network, fully dedicated to the machine learning algorithms’ execution. All nodes having the same specification, described as follows.

  • operating system: Debian GNU/Linux 8.10, kernel 3.16.51-3+deb8u1
  • CPU: Intel Xeon @ 2.60 GHz 
  • architecture: x86_64, Little Endian
  • cores: 16; 2 threads per core
  • RAM memory: 106 GB
  • HDD storage: 500 GB
  • SSD storage: 375 GB

The total capacity of the cluster was: 8 nodes, 128 cores, 3.4 Terabytes of total storage capacity, 848 GB of total RAM memory, managed by the computing framework Apache Spark 2.2.0, and cluster manager YARN - Apache Hadoop 2.8.2.


The Metrics

Five system indicators were analyzed: CPU load, network traffic, memory consumption, disk access, and process load, as follows.


CPU (seven dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • system: O.S. work 
  • user: algorithm's work
  • iowait: input/output busy wait time
  • softirq: software interrupt request 
  • idle: useless time


Memory (six dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • buffer_cache: memory in cache
  • used (O.S. + algorithm): memory used by DML algorithms and O.S.
  • free: available memory  
  • map: number of map operations


Disk (eight dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • bytes_read: data read from storage (MB)
  • bytes_write: data written to the storage (MB)
  • io_read: number of I/O read operations
  • io_write: number of I/O write operations
  • time_spent_read: reading time
  • time_spent_write: writing time


Network (five dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • recv_packets: number of received packets 
  • send_bytes: number of sent bytes
  • send_packets: number of sent packets



  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • load5 **: average load in the last 5 minutes 
  • load10: average load in the last 10 minutes
  • load15: average load in the last 15 minutes
  • proc: number of processes

** The amount of work performed by the system. An idle computer \\has load 0. Each process increments load by 1.


The DML Algorithms

As stated above, the metrics about consumption was gathered from the execution of eleven machine learning algorithms. They are:

1) Alternating Least Squares (ALS) - collaborative filtering technique — introduced by Tapestry [1] developers. It Analyzes relationships between users and items to identify possible new associations. They use neighborhood-based or matrix-factoring methods [2]. The Spark implementation is based on the strategy described in [3].

2) Naive Bayes (NB) - a widely used technique for text classification. It computes the conditional probability distribution on the characteristics and applies the Bayes’ theorem [4] for prediction. Spark implements the Mutinomial Naive Bayes (used in the experiments) and Bernouli Naive Bayes approaches [5].

3) Gradient Boosted Trees (GBT) - used for classification or regression, the latter applied in the tests. Formed by ensem- bles that train a sequence of decision trees. The Spark implementation is based on [6,7].

4) Dense k-means (k||) - clustering technique that separates datasets into k partitions. Spark implements a parallel variant of the algorithm k-means ++ [8], called k-means || [9] (k-means parallel ou dense k-means).

5) Latent Dirichlet Allocation (LDA) - clustering technique that applies a probabilistic model over discrete data collections, such as a Corpus of documents. Used in document modeling, text sorting and collaborative filtering [10].

6) Linear Regression (LinR) - used for regression. Its ancestral form was the least squares method, published by Legendre (1805) and Gauss (1809). The linear regression case deals with a simple equation that has on the right side an intercept and an explanatory variable with an inclination coefficient [11].

7)  Logistic Regression (LogR) - the main reason for choosing the logistic function for the analysis of dichotomous output variables is its flexibility, easily usable and allows judicious interpretation [12]. The experiment uses Spark implementation based on Limited-memory BFGS [13] for classification.

8) Principal Component Analysis (PCA) - dimensionality reduction technique that allows transforming a complex dataset into a smaller dimension, revealing hidden structures in the original dataset [14,15].

9) Random Forests (RF) - a nonparametric statistical method used for regression and classification, based on decision trees and bootstrap [16]. An extensive technical discussion about RFs for Big Data is available in [17].

10) Singular Value Decomposition (SVD) - dimensionality reduction technique proposed by [18]. The Spark implementation is based on matrix optimization techniques on clusters, described in [19].

11) Support Vector Machine (SVM) - a model, used for regression and classification (used in this paper), based on high-dimensionality hyperplane construction [20]. The Spark implementation uses Linear SVM for binary classification.


The Data and the Volume (Scales) Used in the DML Algorithms

We used synthetic data (by Intek HiBench framework due to (i) barriers to fit data requirements for each algorithm, (ii) adjusting the volume of data in each scale and (iii) transferring them from the source to the internal network on the cluster. The scales, in ascending order of size are: large (L), huge (H), gigantic (G), big data 1 (B1), and big data 2 (B2). Same volume, no content variation in scales B1/B2, and content variation in scales G, H, and L. The size of data for each experiment is detailed as follow, in Megabytes, in the sequence L, H, G, B1, and B2 (same size of B1):

  • NB - 359, 1792, 3594, 71885, 71885
  • LogR - 7629, 22886, 38144, 53402, 53402
  • SVM - 19077, 109875, 149544, 171674, 171674
  • RF - 8, 15258, 22886, 33567, 33567
  • LDA - 245, 653, 1976, 4260, 4260
  • k|| - 3830, 19149, 38308, 229816, 229816
  • GBT- 15, 31, 61, 92, 92
  • LinR- 45783, 114463, 305203, 762993
  • ALS - 115, 688, 1372, 1720, 1720
  • PCA- 31, 183, 229, 257, 257
  • SVD - 61, 191, 275, 374, 374

All metrics in the benchmark datasets are gathered from the execution of the above-specified DML algorithms in each one of these scales.



[1] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM. 1992;35(12):61–70.

[2] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8).

[3] Hu Y, Koren Y, Volinsky C. Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on; Ieee; 2008. p.


[4] Vapnik VN, Vapnik V. Statistical learning theory. Vol. 1. Wiley New York; 1998.

[5] Sanderson M, Christopher D, Manning H, et al. Introduction to information retrieval. Natural Language Engineering. 2010;16(1):100.

[6] Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001;:1189–1232.

[7] Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis. 2002;38(4):367–378.

[8] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceed- ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms; Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.

[9] Bahmani B, Moseley B, Vattani A, et al. Scalable k-means++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.

[10] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning research. 2003;3(Jan):993–1022.

[11] Yan X, Su X. Linear regression analysis: theory and computing. World Scientific; 2009.

[12] Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Vol. 398. John Wiley & Sons; 2013.

[13] Spark. Linear Methods - RDD-based API - Logistic Regression; 2017. Available at:

[14] Shlens J. A Tutorial on Principal Component Analysis. Epidemiology. 2005;2(c):223–228.

[15] Jolliffe IT. Principal Component Analysis, Second Edition. Encyclopedia of Statistics in Behavioral Science. 2002;30(3):487.

[16] Breiman L. Random forests. Machine learning. 2001;45(1):5–32.

[17] Genuer R, Poggi JM, Tuleau-Malot C, et al. Random forests for big data. Big Data Research. 2017;9:28–46.

[18] Lehoucq RB, Sorensen DC, Yang C. ARPACK users’ guide: solution of large-scale eigen-value problems with implicitly restarted Arnoldi methods. Vol. 6. Siam; 1998.

[19] Bosagh Zadeh R, Meng X, Ulanov A, et al. Matrix Computations and Optimization in Apache Spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16; New York, New York, USA. ACM Press; 2016. p. 31–38.

[20] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297.