OntoSNAQA is the name that combines Social Network Analysis (SNA), People and Questionnaires (Question and Answers - QA).This ontology will be updated in this project of github and in the url http://www.jabenitez.com/ontologies/OntoSNAQA.owl.It's an ontology that combines three different domains:- People- Questionnaires- Social Network Analysis termsThe mainly objective of this ontology is to achieve a complete automatized Social Network Analysis.


We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.


The sections below explain the specification of the cluster of machines, the content of the data used to run the DML algorithms, the structure of the data metrics (logs) gathered, and the methods applied to collect the execution logs.


Datasets Structure

Each one of the 275 datasets corresponds to one execution of one DML in one specific volume (scale). To separate the data in an appropriated manner, the filenames are organized as <dml_algorithm>_<resource>_<scale>.csv, where:

  • <dml_algorithm> stands for one of eleven executed techniques (see the section about DML algorithms, below)
  • <resource> stands for disc, CPU, memory, processes, and network (see the section about metrics, in the sequence)
  • <scale> stands for b1, b2, gigantic, huge, and large (see the section about data volumes, in the sequence)


This way, a file named als_cpu_huge.csv designates the metrics of the CPU load of the Alternating Least Squares algorithm where applied to solve a problem in the "huge" scale. The same way, a file named kpar_mem_B1.csv stores the metrics for memory usage of the K-means Parallel (Dense k-means), and so on.


Computer Cluster Specification

The experiments were hosted on Google Cloud Data Proc environment. The cluster was composed of eight high power machines, a master node and seven slave nodes, totalizing 128 cores, integrated to the same internal network, fully dedicated to the machine learning algorithms’ execution. All nodes having the same specification, described as follows.

  • operating system: Debian GNU/Linux 8.10, kernel 3.16.51-3+deb8u1
  • CPU: Intel Xeon @ 2.60 GHz 
  • architecture: x86_64, Little Endian
  • cores: 16; 2 threads per core
  • RAM memory: 106 GB
  • HDD storage: 500 GB
  • SSD storage: 375 GB

The total capacity of the cluster was: 8 nodes, 128 cores, 3.4 Terabytes of total storage capacity, 848 GB of total RAM memory, managed by the computing framework Apache Spark 2.2.0, and cluster manager YARN - Apache Hadoop 2.8.2.


The Metrics

Five system indicators were analyzed: CPU load, network traffic, memory consumption, disk access, and process load, as follows.


CPU (seven dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • system: O.S. work 
  • user: algorithm's work
  • iowait: input/output busy wait time
  • softirq: software interrupt request 
  • idle: useless time


Memory (six dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • buffer_cache: memory in cache
  • used (O.S. + algorithm): memory used by DML algorithms and O.S.
  • free: available memory  
  • map: number of map operations


Disk (eight dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • bytes_read: data read from storage (MB)
  • bytes_write: data written to the storage (MB)
  • io_read: number of I/O read operations
  • io_write: number of I/O write operations
  • time_spent_read: reading time
  • time_spent_write: writing time


Network (five dimensions)

  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • recv_packets: number of received packets 
  • send_bytes: number of sent bytes
  • send_packets: number of sent packets



  • moment: the order of the measure over time (1, 2, 3, ...)
  • node: the ID of the worker in the cluster (1 to 7)
  • load5 **: average load in the last 5 minutes 
  • load10: average load in the last 10 minutes
  • load15: average load in the last 15 minutes
  • proc: number of processes

** The amount of work performed by the system. An idle computer \\has load 0. Each process increments load by 1.


The DML Algorithms

As stated above, the metrics about consumption was gathered from the execution of eleven machine learning algorithms. They are:

1) Alternating Least Squares (ALS) - collaborative filtering technique — introduced by Tapestry [1] developers. It Analyzes relationships between users and items to identify possible new associations. They use neighborhood-based or matrix-factoring methods [2]. The Spark implementation is based on the strategy described in [3].

2) Naive Bayes (NB) - a widely used technique for text classification. It computes the conditional probability distribution on the characteristics and applies the Bayes’ theorem [4] for prediction. Spark implements the Mutinomial Naive Bayes (used in the experiments) and Bernouli Naive Bayes approaches [5].

3) Gradient Boosted Trees (GBT) - used for classification or regression, the latter applied in the tests. Formed by ensem- bles that train a sequence of decision trees. The Spark implementation is based on [6,7].

4) Dense k-means (k||) - clustering technique that separates datasets into k partitions. Spark implements a parallel variant of the algorithm k-means ++ [8], called k-means || [9] (k-means parallel ou dense k-means).

5) Latent Dirichlet Allocation (LDA) - clustering technique that applies a probabilistic model over discrete data collections, such as a Corpus of documents. Used in document modeling, text sorting and collaborative filtering [10].

6) Linear Regression (LinR) - used for regression. Its ancestral form was the least squares method, published by Legendre (1805) and Gauss (1809). The linear regression case deals with a simple equation that has on the right side an intercept and an explanatory variable with an inclination coefficient [11].

7)  Logistic Regression (LogR) - the main reason for choosing the logistic function for the analysis of dichotomous output variables is its flexibility, easily usable and allows judicious interpretation [12]. The experiment uses Spark implementation based on Limited-memory BFGS [13] for classification.

8) Principal Component Analysis (PCA) - dimensionality reduction technique that allows transforming a complex dataset into a smaller dimension, revealing hidden structures in the original dataset [14,15].

9) Random Forests (RF) - a nonparametric statistical method used for regression and classification, based on decision trees and bootstrap [16]. An extensive technical discussion about RFs for Big Data is available in [17].

10) Singular Value Decomposition (SVD) - dimensionality reduction technique proposed by [18]. The Spark implementation is based on matrix optimization techniques on clusters, described in [19].

11) Support Vector Machine (SVM) - a model, used for regression and classification (used in this paper), based on high-dimensionality hyperplane construction [20]. The Spark implementation uses Linear SVM for binary classification.


The Data and the Volume (Scales) Used in the DML Algorithms

We used synthetic data (by Intek HiBench framework due to (i) barriers to fit data requirements for each algorithm, (ii) adjusting the volume of data in each scale and (iii) transferring them from the source to the internal network on the cluster. The scales, in ascending order of size are: large (L), huge (H), gigantic (G), big data 1 (B1), and big data 2 (B2). Same volume, no content variation in scales B1/B2, and content variation in scales G, H, and L. The size of data for each experiment is detailed as follow, in Megabytes, in the sequence L, H, G, B1, and B2 (same size of B1):

  • NB - 359, 1792, 3594, 71885, 71885
  • LogR - 7629, 22886, 38144, 53402, 53402
  • SVM - 19077, 109875, 149544, 171674, 171674
  • RF - 8, 15258, 22886, 33567, 33567
  • LDA - 245, 653, 1976, 4260, 4260
  • k|| - 3830, 19149, 38308, 229816, 229816
  • GBT- 15, 31, 61, 92, 92
  • LinR- 45783, 114463, 305203, 762993
  • ALS - 115, 688, 1372, 1720, 1720
  • PCA- 31, 183, 229, 257, 257
  • SVD - 61, 191, 275, 374, 374

All metrics in the benchmark datasets are gathered from the execution of the above-specified DML algorithms in each one of these scales.



[1] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM. 1992;35(12):61–70.

[2] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8).

[3] Hu Y, Koren Y, Volinsky C. Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on; Ieee; 2008. p.


[4] Vapnik VN, Vapnik V. Statistical learning theory. Vol. 1. Wiley New York; 1998.

[5] Sanderson M, Christopher D, Manning H, et al. Introduction to information retrieval. Natural Language Engineering. 2010;16(1):100.

[6] Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001;:1189–1232.

[7] Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis. 2002;38(4):367–378.

[8] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceed- ings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms; Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.

[9] Bahmani B, Moseley B, Vattani A, et al. Scalable k-means++. Proceedings of the VLDB Endowment. 2012;5(7):622–633.

[10] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning research. 2003;3(Jan):993–1022.

[11] Yan X, Su X. Linear regression analysis: theory and computing. World Scientific; 2009.

[12] Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Vol. 398. John Wiley & Sons; 2013.

[13] Spark. Linear Methods - RDD-based API - Logistic Regression; 2017. Available at: https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#logistic-r....

[14] Shlens J. A Tutorial on Principal Component Analysis. Epidemiology. 2005;2(c):223–228.

[15] Jolliffe IT. Principal Component Analysis, Second Edition. Encyclopedia of Statistics in Behavioral Science. 2002;30(3):487.

[16] Breiman L. Random forests. Machine learning. 2001;45(1):5–32.

[17] Genuer R, Poggi JM, Tuleau-Malot C, et al. Random forests for big data. Big Data Research. 2017;9:28–46.

[18] Lehoucq RB, Sorensen DC, Yang C. ARPACK users’ guide: solution of large-scale eigen-value problems with implicitly restarted Arnoldi methods. Vol. 6. Siam; 1998.

[19] Bosagh Zadeh R, Meng X, Ulanov A, et al. Matrix Computations and Optimization in Apache Spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16; New York, New York, USA. ACM Press; 2016. p. 31–38.

[20] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297.


SDTwittC consists of 200 authors evenly balanced by gender (100 for each). We identified the gender of the tweeters via their names and profile pictures. As potential copy-and-paste texts, both tweets and retweets are discarded in the first place. Only replies are compiled. The number of replies for each author varies from hundreds to thousands. Male authors produced 233926 replies whereas 219740 replies are generated by the female group


This dataset was created based on the paper 'Andras Hajdu, Gyorgy Terdik, Attila Tiba, and Henrietta Toman: A stochastic approach to handle knapsack problems in the creation of ensembles'.To summarize our experimental setup for UCI binary classification problems, we have considered base classifiers perceptron, decision tree, Levenberg-Marquardt feedforward neural network, random neural network, and discriminative restricted Boltzmann machine classifier for the 5 UCI datasets MAGIC Gamma Telescope, HIGGS, EEG EyeState, Musk (Version 2), and Spambase; datasets of large cardinalities were sele




Characteristic impedance Result of Microstrip Transmission lines with 3D EM simulation tool


These data had been donated by Peyman Mahouti in 2019.

Donators note:

Please cite the following paper if you use this data set:

[1]      Mahouti P, Gunes F, Belen MA, Demirel S. Symbolic Regression for Derivation of an Accurate Analytical Formulation Using Big Data : An Application Example. ACES JOURNAL 2017; 32(5): 574-591.



Supplementary data for the IEEE Access paper Applicability of Immersive Analytics in Mixed Reality: Usability Study


To obtain the prices of parts from the manufacturing characteristics and other manufacturing processes, feature quantity expression is innovatively applied. By identifying manufacturing features and calculating the feature quantities, the feature quantities are described in the form of assignments as data. To obtain the prices of parts intelligently, the most widely used and mature deep-learning method is adopted to realize the accurate quotation of parts


This dataset used in the experiment of paper "Bus Ridesharing Scheduling Problem". This is a real-world bus ridesharing scheduling problem of Chengdu city in China, which includes 10 depots, 2,000 trips.


This is the dataset used in the experiment of paper "Bus Pooling: A Large-Scale Bus Ridesharing Service". The dataset contains 60,822,634 trajectory data from 11,922 Shanghai taxis from one day (Apr 1, 2018). The 100 groups of coordinate sets containing three coordinates as experimental samples are used to compare the effectiveness and efficiency of location-allocation algorithms.